1

Using Python, I want to identify French text in a list of short strings (from 1 to about 50 words) which are otherwise in English.

An example of the input data (input strings here are separated by commas):

year of the snake, legendary 'dragon horse', thunder, damsel-fly, larvae of mosquito, 
treillage, libellule, mythical water creature, petites chevrettes, de papillon hideux, 
the horse-fly, 5th earthly branch, dragon, mythical creature, 
a shore plant whose leaves dry a bright orange, dragon horse, god of rain, year of the dragon, 
orthopteran, crocodile, dont le duvet des ailes s'en va en poussière, insecte, dragonfly, 
dracontomelon vitiense, dragon king, petit filet pour une espèce de papillon, sorte d'insecte

Ideally I want to use a library that's already been built, as I'm aware that this is a difficult problem. However, the natural language library in Python I am most familiar with, nltk, does not seem to have the ability to do this, or if it does I haven't found it.

I'm aware that identifying a single word or two is likely to be very difficult, and I'd rather have false negatives (French misidentified as English) than false positives.

Rowan Jacobs
  • 389
  • 4
  • 13
  • 1
    There are datasets and NN models [here](https://towardsdatascience.com/deep-neural-network-language-identification-ae1c158f6a7d) and [here](https://medium.com/@amarbudhiraja/supervised-language-identification-for-short-and-long-texts-with-code-626f9c78c47c) to do so! – meti Aug 23 '21 at 07:39

1 Answers1

2

There are various approaches to this problem. A rather more traditional and exact (but also prone to issues with new words) is to use a thesaurus for French and English and check if the phrase is found in one or the other (full match or more words matching).

Another one is to use a package for language detection.

Yet another one would be to use an ML language model to classify phrases (e.g. SpaCy lang_detect model).

sophros
  • 14,672
  • 11
  • 46
  • 75
  • Thank you! I chose to use langid since it had the best performance on the data I was looking at, but langdetect (also suggested by Jordi Carr on the nltk-users mailing list) and cld3 were also viable options for this task. – Rowan Jacobs Aug 27 '21 at 21:04