3

For a research-purpose work I should:

  1. Read a .csv file
  2. Detect the language of the text by the title
  3. Identifying the argument of the text by some keywords ex. lobotomy --> brain

I am trying to do the 2nd and 3rd point using Python with its library NLTK, Could you give me some tips if you ever did something like it?

Thank you in advance!

senshi
  • 41
  • 2

1 Answers1

2

It's not fullproof but you can try several language identification tools.

Using langid.py

One of the most popular and easiest to use, being langid.py https://github.com/saffsd/langid.py

To install: python -m pip install -U langid

>>> import langid

>>> text = "Hallo, wie gehts?"
>>> lang, log_prob = langid.classify(text)
>>> print(lang)
de

Using pyCLD2

The pycld2 is a wrapper around chromium-compact-language-detector, see https://github.com/aboSamoor/pycld2

Install: python -m pip install -U pycld2

>>> import pycld2 as cld2

>>> text = "Hallo, wie gehts?"

>>> isReliable, textBytesFound, details = cld2.detect(text)
>>> lang = details[0][1]
>>> print(lang)
de

Using cld3

Install: python -m pip install -U pycld3

>>> import cld3

>>> text = "Hallo, wie gehts?"

>>> prediction = cld3.get_language(text)
>>> print(prediction.language)
de

Here's a pretty nice recent summary (2019) from https://arxiv.org/pdf/1910.06748.pdf

alvas
  • 115,346
  • 109
  • 446
  • 738