- #OPUS BITEXT AND MONOLINGUAL DATA INSTALL#
- #OPUS BITEXT AND MONOLINGUAL DATA MANUAL#
- #OPUS BITEXT AND MONOLINGUAL DATA DOWNLOAD#
ExamplesĪ very simple configuration file that downloads a parallel corpus The function at minimum the output files are defined there. Should be run, and parameters is a dictionary with keys that depend on Type is a string that defines the function that Increasing the valueįrom the default 100000 may speed up things at the cost of increasedĮach step in steps is a dictionary (mapping) with two keys: typeĪnd parameters. (with filterfalse option) and score steps. chunksize for changing the default chunk size option for filter.It is not set, the current working directory is used. output_directory for setting where to write the output files.The valid options for the common section includes: The -overwrite option will force overwrite Step has number 1, and -1 points to the last step, -2 to the second toīy default, existing output files will be re-used, and the steps Make sure that all input files for the step already exist. Options for setting the last step to run ( -last) and running Has been processed (if no exceptions were raised). The script will run the steps one by one and stops when the final step Where CONFIG is path to the configuration file. If you use OpusFilter in your research, please cite our ACL 2020 paper:
#OPUS BITEXT AND MONOLINGUAL DATA INSTALL#
Will need Cython to install the Python interface to eflomal. The Python scripts align.py and makepriors.py. Variable EFLOMAL_PATH to eflomal's root directory, which contains by setting the PYTHONPATH environment variable).įor using word alignment filters, you need to install elfomal The library files compiled to build/lib/python to your Python Require a number of additional libraries, including PyTorch, jieba,įor using n-gram language model filters, you need to install VariKN Optional libraries and toolsįor Chinese tokenization (word segmentation), you can use theĪutomatically with pip by including the extras or įor Japanese tokenization (word segmentation), you can use theįor using sentence embeddings filters, you need to installīe installed automatically with pip by including the extras See setup.py for possible version requirements.
#OPUS BITEXT AND MONOLINGUAL DATA MANUAL#
On Linux, it should work directly for Python versionsįrom 3.6 to 3.8, but with Python 3.9 the fast-mosestokenizer libraryĬurrently requires a manual install. Note that all required libraries are not available to install via PyPI
#OPUS BITEXT AND MONOLINGUAL DATA DOWNLOAD#
Uses the OpusTools library (Aulamo et al., 2020) to download data from OpusFilter is a tool for filtering and combining parallel corpora.