2

Is there any package for identifying which language is a text in R? I have many rows including text in different languages like "en", "es", "fr", "ja" and so on.. Is it possible to get result with language column like below?

id text                 language
1  "I am a musician"    en 
2  "я инженер"          ru 
3  "Je suis un poète"   fr

Or any other possible help to define type of natural language?

gulnerman
  • 23
  • 5

1 Answers1

2

Your best shot is probably cldr, it uses Chrome's language detection library.

library(devtools)
install_github("aykutfirat/cldr")

library(cldr)

docs1 <- c(
  "Detects the language of a set of documents with possible input hints. Returns the top 3 candidate languages and their probabilities as well.",
  "Som nevnt på møte forrige uke er det ulike ting som skjer denne og neste uke.",
  "Ganz besonders wollen wir, dass forthin allenthalben in unseren Städten, Märkten und auf dem Lande zu keinem Bier mehr Stücke als allein Gersten, Hopfen und Wasser verwendet und gebraucht werden sollen.",
  "Роман Гёте «Вильгельм Майстер» заложил основы воспитательного романа эпохи Просвещения.")

detectLanguage(docs1)$detectedLanguage
# [1] "ENGLISH" "NORWEGIAN" "GERMAN" "RUSSIAN"

However, your examples seems to be a bit too short.

docs2 <- c("I am a musician", "я инженер", "Je suis un poète")

detectLanguage(docs2)$detectedLanguage
# [1] "Unknown" "Unknown" "Unknown"

As noted by Ben textcat seems to perform better on the shorter examples given by gulnerman, but unlike cldr it doesn't indicate how reliable the matches are. This makes it difficult to say how much you can trust the results, even though two out of three were correct in this case.

library(textcat)
textcat(docs2)
# [1] "latin" "russian-iso8859_5" "french"           
AkselA
  • 8,153
  • 2
  • 21
  • 34
  • Thanks for all answers and comments, I am trying the two suggested way; textcat and cldr. Could you also share if anyone has experienced with some mix of different language words, which one would be better? – gulnerman Feb 12 '18 at 13:27
  • 1
    looking at the examples from the linked question, it seems `textcat` is better at handling short strings? – Ben Bolker Feb 12 '18 at 14:01
  • 1
    My benchmark on duplicate linked question shows for short articles `textcat` is much slower. It also fails in some fairly obvious cases. https://stackoverflow.com/questions/8078604/detect-text-language-in-r/48790792#48790792 – moodymudskipper Feb 14 '18 at 15:49
  • Actually I couldn't install cldr (Bad credentials 401), I have installed cldr2-3 & textcat. So cldr2 works like below. > detect_language(docs2) [1] "lb" "ru" "fr" anyway, my dataset has >40th rows but textcat doesn't detect ch and ja characters and detecting them as sanskrit, frisian and even german..cldr2-3 for the same samples of ch and ja detecting all as "ja". Plus cldr detects ru correctly, but >21000 records are not detected by cldr. – gulnerman Feb 21 '18 at 12:51
  • ```textcat``` is super easy to install and use, but performs really bad on tweet text. Error rate is <40% in my sample. Not even bothering with an precise number it is so bad. Japanese text is repeatedly classified as German. Problem that @moodymudskipper mentioned persists. – Simone Apr 12 '22 at 07:49