-2

I have to Tokenize a text to words. But I don't know the language of text. I could be any language. So I have to build a Tokenizer which will detect text language and tokenize it. If Tokenizer is not able to tokenize then I will return some flag like "not able to tokenize".

Please help me to tokenize non-space languages if its possible.

jay_phate
  • 439
  • 3
  • 14
  • 2
    https://cran.r-project.org/web/views/NaturalLanguageProcessing.html – G. Grothendieck May 06 '16 at 10:39
  • Hi, Your question is off-topic to be asked on Stack Overflow as it is a tool request. However, you can discuss about it in a chatroom, like [this](http://chat.stackoverflow.com/rooms/25312/r-public), if you are interested. – Bhargav Rao May 10 '16 at 18:58

1 Answers1

1

Have a look at the textcat package. It can be used to find the language of a text or text fragment.

It uses heuristics to determine the language. It makes an informed guess. Therefore, it will often be wrong. Error rate depends on the nature of your data, of course. You can help textcat out by excluding the languages that a text will probably NOT be written in.

You can set it up like this. For more details, read the documentation.

library(textcat) 
#%nin% from Hmisc
my.profiles <- ECIMCI_profiles[names(ECIMCI_profiles) %nin% c("afrikaans",
                                       "basque",
                                       "frisian","middle_frisian",
                                       "latin",
                                       "rumantsch",
                                       "spanish",
                                       "welsh",
                                       "catalan",
                                       "hungarian",
                                       "romanian",
                                       "scots",
                                       "swedish")]

# ... process corpus as usual...
# then try to assign a language to each document.

myCorpusCopy <- tm_map(myCorpus, function(x){
        #lang <- textcat::textcat(content(x))
        lang <- textcat::textcat(content(x), p=my.profiles)
        #warning(lang)
        meta(x, tag="language") <- lang
        x
})

# continue processing..

Update:

You said "I don't know the language of text." I thought you necessarily needed to classify the text first in order to predict the language it is written in. The code snippet above does this in an automated manner.

Tokenization would be the next step. http://stanfordnlp.github.io/CoreNLP/ offers language models in Chinese, English, French, German, Spanish. I R, you can call these with

library(coreNLP)
initCoreNLP()
### lots of startup messages...

Writing robust R code using the coreNLP + addons libraries is a nontrivial task where I cannot help you much. It takes some time to get it right for a single language. Read my answer here from january 2016 (when I played with coreNLP), to get started: https://stackoverflow.com/a/34852313/202553

Community
  • 1
  • 1
knb
  • 9,138
  • 4
  • 58
  • 85