How to detect the language of a text (.csv) by its title using Python?

Question

For a research-purpose work I should:

Read a .csv file
Detect the language of the text by the title
Identifying the argument of the text by some keywords ex. lobotomy --> brain

I am trying to do the 2nd and 3rd point using Python with its library NLTK, Could you give me some tips if you ever did something like it?

Thank you in advance!

For the second point I'd try that : https://stackoverflow.com/questions/39142778/python-how-to-determine-the-language — Simon A, May 18 '20 at 14:48

alvas · Answer 1 · 2020-05-19T02:25:25.000

It's not fullproof but you can try several language identification tools.

Using `langid.py`

One of the most popular and easiest to use, being langid.py https://github.com/saffsd/langid.py

To install: python -m pip install -U langid

>>> import langid

>>> text = "Hallo, wie gehts?"
>>> lang, log_prob = langid.classify(text)
>>> print(lang)
de

Using `pyCLD2`

The pycld2 is a wrapper around chromium-compact-language-detector, see https://github.com/aboSamoor/pycld2

Install: python -m pip install -U pycld2

>>> import pycld2 as cld2

>>> text = "Hallo, wie gehts?"

>>> isReliable, textBytesFound, details = cld2.detect(text)
>>> lang = details[0][1]
>>> print(lang)
de

Using `cld3`

Install: python -m pip install -U pycld3

>>> import cld3

>>> text = "Hallo, wie gehts?"

>>> prediction = cld3.get_language(text)
>>> print(prediction.language)
de

Here's a pretty nice recent summary (2019) from https://arxiv.org/pdf/1910.06748.pdf

How to detect the language of a text (.csv) by its title using Python?

1 Answers1

Using langid.py

Using pyCLD2

Using cld3

Using `langid.py`

Using `pyCLD2`

Using `cld3`