tl;dr

  • Looking through papers to track down the Irish-English parallel corpora they used was a real pain, so I built nlp.irish to document where to find them and how to process them easily

What?

  • The intention behind nlp.irish is to make NLP for folks new to working with Irish a little easier by documenting the datasets that are available out there, where to find them and how to load them to a pandas dataframe.
  • The site is hosted on github here with the intention that it will grow via a collaborative effort of those working in Irish NLP.

Where?

Why?

  • Irish is a low-resource language and every piece of data out there is valuable.

Current Data

  • As of writing, 5 commonly used Irish-English parallel corpora have been documented, with instructions on where to find them and code on how to process them:

    • ParaCrawl, v6
    • DGT-TM, DGT-Translation Memory
    • DCEP, Digital Corpus of the European Parliament
    • ELRC, European Language Resource Coordination
    • Tatoeba

Contributing

  • Contributing is as easy as submitting a pull request on Github. Alternatively you can find me on twitter at @mcgenergy and I can help update the site with your contibution.