This is a common issue with the tm
package (1, 2, 3).
One non-R
way to fix it is to use a text editor to find and replace all the fancy characters (ie. those with diacritics) in your text before loading it into R
(or use gsub
in R
). For example you'd search and replace all instances of the O-umlaut in Öl-Teppich. Others have had success with this (I have too), but if you have thousands of individual text files obviously this is no good.
For an R
solution, I found that using VectorSource
instead of DirSource
seems to solve the problem:
# I put your example text in a file and tested it with both ANSI and
# UTF-8 encodings, both enabled me to reproduce your problem
#
tmp <- Corpus(DirSource('C:\\...\\tmp/'))
tmp <- tm_map(dataSet, tolower)
Error in FUN(X[[1L]], ...) :
invalid input 'RT @noXforU Erneut riesiger (Alt-)Öl–teppich im Golf von Mexiko (#pics vom Freitag) http://bit.ly/bw1hvU http://bit.ly/9R7JCf #oilspill #bp' in 'utf8towcs'
# quite similar error to what you got, both from ANSI and UTF-8 encodings
#
# Now try VectorSource instead of DirSource
tmp <- readLines('C:\\...\\tmp.txt')
tmp
[1] "RT @noXforU Erneut riesiger (Alt-)Öl–teppich im Golf von Mexiko (#pics vom Freitag) http://bit.ly/bw1hvU http://bit.ly/9R7JCf #oilspill #bp"
# looks ok so far
tmp <- Corpus(VectorSource(tmp))
tmp <- tm_map(tmp, tolower)
tmp[[1]]
rt @noxforu erneut riesiger (alt-)öl–teppich im golf von mexiko (#pics vom freitag) http://bit.ly/bw1hvu http://bit.ly/9r7jcf #oilspill #bp
# seems like it's worked just fine. It worked for best for ANSI encoding.
# There was no error with UTF-8 encoding, but the Ö was returned
# as ã– which is not good
But this seems like a bit of a lucky coincidence. There must be a more direct way about it. Do let us know what works for you!