Your best shot is probably cldr
, it uses Chrome's language detection library.
library(devtools)
install_github("aykutfirat/cldr")
library(cldr)
docs1 <- c(
"Detects the language of a set of documents with possible input hints. Returns the top 3 candidate languages and their probabilities as well.",
"Som nevnt på møte forrige uke er det ulike ting som skjer denne og neste uke.",
"Ganz besonders wollen wir, dass forthin allenthalben in unseren Städten, Märkten und auf dem Lande zu keinem Bier mehr Stücke als allein Gersten, Hopfen und Wasser verwendet und gebraucht werden sollen.",
"Роман Гёте «Вильгельм Майстер» заложил основы воспитательного романа эпохи Просвещения.")
detectLanguage(docs1)$detectedLanguage
# [1] "ENGLISH" "NORWEGIAN" "GERMAN" "RUSSIAN"
However, your examples seems to be a bit too short.
docs2 <- c("I am a musician", "я инженер", "Je suis un poète")
detectLanguage(docs2)$detectedLanguage
# [1] "Unknown" "Unknown" "Unknown"
As noted by Ben textcat
seems to perform better on the shorter examples given by gulnerman, but unlike cldr
it doesn't indicate how reliable the matches are. This makes it difficult to say how much you can trust the results, even though two out of three were correct in this case.
library(textcat)
textcat(docs2)
# [1] "latin" "russian-iso8859_5" "french"