Opus bitext and monolingual data

#OPUS BITEXT AND MONOLINGUAL DATA INSTALL#
#OPUS BITEXT AND MONOLINGUAL DATA MANUAL#
#OPUS BITEXT AND MONOLINGUAL DATA DOWNLOAD#

ExamplesĪ very simple configuration file that downloads a parallel corpus The function at minimum the output files are defined there. Should be run, and parameters is a dictionary with keys that depend on Type is a string that defines the function that Increasing the valueįrom the default 100000 may speed up things at the cost of increasedĮach step in steps is a dictionary (mapping) with two keys: typeĪnd parameters. (with filterfalse option) and score steps. chunksize for changing the default chunk size option for filter.It is not set, the current working directory is used. output_directory for setting where to write the output files.The valid options for the common section includes: The -overwrite option will force overwrite Step has number 1, and -1 points to the last step, -2 to the second toīy default, existing output files will be re-used, and the steps Make sure that all input files for the step already exist. Options for setting the last step to run ( -last) and running Has been processed (if no exceptions were raised). The script will run the steps one by one and stops when the final step Where CONFIG is path to the configuration file. If you use OpusFilter in your research, please cite our ACL 2020 paper:

#OPUS BITEXT AND MONOLINGUAL DATA INSTALL#

Will need Cython to install the Python interface to eflomal. The Python scripts align.py and makepriors.py. Variable EFLOMAL_PATH to eflomal's root directory, which contains by setting the PYTHONPATH environment variable).įor using word alignment filters, you need to install elfomal The library files compiled to build/lib/python to your Python Require a number of additional libraries, including PyTorch, jieba,įor using n-gram language model filters, you need to install VariKN Optional libraries and toolsįor Chinese tokenization (word segmentation), you can use theĪutomatically with pip by including the extras or įor Japanese tokenization (word segmentation), you can use theįor using sentence embeddings filters, you need to installīe installed automatically with pip by including the extras See setup.py for possible version requirements.

#OPUS BITEXT AND MONOLINGUAL DATA MANUAL#

On Linux, it should work directly for Python versionsįrom 3.6 to 3.8, but with Python 3.9 the fast-mosestokenizer libraryĬurrently requires a manual install. Note that all required libraries are not available to install via PyPI

Special character and similarity filters.

Script and language identification filters.

OpusFilter has been presented in ACL 2020 system demonstrations.Ī changelog is available in docs/CHANGELOG.md.

Extendable with your own filters written in Python.

on language identification, word aligment, n-gram language models, and multilingual sentence embeddings

Memory-efficient processing of large files.

Implementations for many common text file operations on parallel files.

Simple downloading of parallel corpora from OPUS with OpusTools.

Corpus preprocessing pipelines configured with YAML.

The OPUS corpus collection (Tiedemann, 2012), but can be used with any

#OPUS BITEXT AND MONOLINGUAL DATA DOWNLOAD#

Uses the OpusTools library (Aulamo et al., 2020) to download data from OpusFilter is a tool for filtering and combining parallel corpora.