Have a look at the textcat
package. It can be used to find the language of a text or text fragment.
It uses heuristics to determine the language. It makes an informed guess. Therefore, it will often be wrong. Error rate depends on the nature of your data, of course. You can help textcat out by excluding the languages that a text will probably NOT be written in.
You can set it up like this. For more details, read the documentation.
library(textcat)
#%nin% from Hmisc
my.profiles <- ECIMCI_profiles[names(ECIMCI_profiles) %nin% c("afrikaans",
"basque",
"frisian","middle_frisian",
"latin",
"rumantsch",
"spanish",
"welsh",
"catalan",
"hungarian",
"romanian",
"scots",
"swedish")]
# ... process corpus as usual...
# then try to assign a language to each document.
myCorpusCopy <- tm_map(myCorpus, function(x){
#lang <- textcat::textcat(content(x))
lang <- textcat::textcat(content(x), p=my.profiles)
#warning(lang)
meta(x, tag="language") <- lang
x
})
# continue processing..
Update:
You said "I don't know the language of text." I thought you necessarily needed to classify the text first in order to predict the language it is written in. The code snippet above does this in an automated manner.
Tokenization would be the next step. http://stanfordnlp.github.io/CoreNLP/ offers language models in Chinese, English, French, German, Spanish.
I R, you can call these with
library(coreNLP)
initCoreNLP()
### lots of startup messages...
Writing robust R code using the coreNLP + addons libraries is a nontrivial task where I cannot help you much. It takes some time to get it right for a single language.
Read my answer here from january 2016 (when I played with coreNLP), to get started:
https://stackoverflow.com/a/34852313/202553