I am using tm()
and wordcloud()
for some basic data-mining in R, but am running into difficulties because there are non-English characters in my dataset (even though I've tried to filter out other languages based on background variables.
Let's say that some of the lines in my TXT file (saved as UTF-8 in TextWrangler) look like this:
Special
satisfação
Happy
Sad
Potential für
I then read my txt file into R:
words <- Corpus(DirSource("~/temp", encoding = "UTF-8"),readerControl = list(language = "lat"))
This yields the warning message:
Warning message:
In readLines(y, encoding = x$Encoding) :
incomplete final line found on '/temp/file.txt'
But since it's a warning, not an error, I continue to push forward.
words <- tm_map(words, stripWhitespace)
words <- tm_map(words, tolower)
This then yields the error:
Error in FUN(X[[1L]], ...) : invalid input 'satisfa��o' in 'utf8towcs'
I'm open to finding ways to filter out the non-English characters either in TextWrangler or R; whatever is the most expedient. Thanks for your help!