How to use tm package for text analytics in Hebrew or Arabic

Question

I would like to use tm package for Hebrew or Arabic text analysis. I tried several methods to see if tm will be able to process some words but i ran into an error, Is there a way to solve this issue?

 text  <- "הנוסעים חיכו זמן רב לנסיעה"
 Encoding(text)
#[1] "unknown"
 Encoding(text)  <- "UTF-8"
 ap.corpus <- Corpus(DataframeSource(data.frame(text)))
 ap.corpus <- tm_map(ap.corpus, removePunctuation)
 ap.corpus <- tm_map(ap.corpus, content_transformer(tolower))
Error in FUN(content(x), ...) : 
  invalid input 'הנוסעים חיכו זמן רב לנסיעה' in 'utf8towcs'

Hoju · Answer 1 · 2017-07-06T19:31:56.067

From the tm vignette:

The second argument readerControl of the corpus constructor has to be a list with the named components reader and language. (...) Finally, the second component language sets the texts’ language (preferably using ISO 639-2 codes).

From Wikipedia, the ISO 639-2 code for Arabic is ara and for Hebrew heb. So maybe try this:

 ap.corpus <- Corpus(DataframeSource(data.frame(text), readerControl = list(language = "heb")))

Edit: Glad you found the answer. When the wrong encoding is used this error comes up:

Hi @Hoju, I got: Error in DataframeSource(data.frame(text), readerControl = list(language = "heb")) : unused argument (readerControl = list(language = "heb")) — mql4beginner, Jul 06 '17 at 18:04

score 1 · Accepted Answer · answered Jul 06 '17 at 19:02

1

Here is the answer, we need to add this encoding method:

iconv(text, "ISO-8859-8", "UTF-8")[1]

Instead of using : Encoding(text) <- "UTF-8"

answered Jul 06 '17 at 19:02

mql4beginner

2,193
5
34
73

How to use tm package for text analytics in Hebrew or Arabic

2 Answers2