1

I would like to use tm package for Hebrew or Arabic text analysis. I tried several methods to see if tm will be able to process some words but i ran into an error, Is there a way to solve this issue?

 text  <- "הנוסעים חיכו זמן רב לנסיעה"
 Encoding(text)
#[1] "unknown"
 Encoding(text)  <- "UTF-8"
 ap.corpus <- Corpus(DataframeSource(data.frame(text)))
 ap.corpus <- tm_map(ap.corpus, removePunctuation)
 ap.corpus <- tm_map(ap.corpus, content_transformer(tolower))
Error in FUN(content(x), ...) : 
  invalid input 'הנוסעים חיכו זמן רב לנסיעה' in 'utf8towcs'
mql4beginner
  • 2,193
  • 5
  • 34
  • 73

2 Answers2

1

From the tm vignette:

The second argument readerControl of the corpus constructor has to be a list with the named components reader and language. (...) Finally, the second component language sets the texts’ language (preferably using ISO 639-2 codes).

From Wikipedia, the ISO 639-2 code for Arabic is ara and for Hebrew heb. So maybe try this:

 ap.corpus <- Corpus(DataframeSource(data.frame(text), readerControl = list(language = "heb")))

Edit: Glad you found the answer. When the wrong encoding is used this error comes up:

Hoju
  • 139
  • 1
  • 13
  • Hi @Hoju, I got: Error in DataframeSource(data.frame(text), readerControl = list(language = "heb")) : unused argument (readerControl = list(language = "heb")) – mql4beginner Jul 06 '17 at 18:04
1

Here is the answer, we need to add this encoding method:

iconv(text, "ISO-8859-8", "UTF-8")[1]

Instead of using : Encoding(text) <- "UTF-8"

mql4beginner
  • 2,193
  • 5
  • 34
  • 73