R tm package invalid input in 'utf8towcs'

Question

I'm trying to use the tm package in R to perform some text analysis. I tied the following:

require(tm)
dataSet <- Corpus(DirSource('tmp/'))
dataSet <- tm_map(dataSet, tolower)
Error in FUN(X[[6L]], ...) : invalid input 'RT @noXforU Erneut riesiger (Alt-)�lteppich im Golf von Mexiko (#pics vom Freitag) http://bit.ly/bw1hvU http://bit.ly/9R7JCf #oilspill #bp' in 'utf8towcs'

The problem is some characters are not valid. I'd like to exclude the invalid characters from analysis either from within R or before importing the files for processing.

I tried using iconv to convert all files to utf-8 and exclude anything that can't be converted to that as follows:

find . -type f -exec iconv -t utf-8 "{}" -c -o tmpConverted/"{}" \;

as pointed out here Batch convert latin-1 files to utf-8 using iconv

But I still get the same error.

I'd appreciate any help.

score 72 · Answer 1 · edited Apr 01 '14 at 19:29

72

None of the above answers worked for me. The only way to work around this problem was to remove all non graphical characters (http://stat.ethz.ch/R-manual/R-patched/library/base/html/regex.html).

The code is this simple

usableText=str_replace_all(tweets$text,"[^[:graph:]]", " ")

edited Apr 01 '14 at 19:29

SztupY

10,291
8
64
87

answered Apr 01 '14 at 19:12

David

741
6
5

2

This should be marked as the solution. It works and it's been popular for years, but the OP didn't stick around to mark it as being correct. – Hack-R Jul 15 '17 at 21:15
2

as an alternative using base r, you can try: `usableText <- iconv(tweets$text, "ASCII", "UTF-8", sub="")` – Agile Bean Mar 19 '18 at 14:36

score 25 · Answer 2 · edited Jul 24 '12 at 14:54

25

This is from the tm faq:

it will replace non-convertible bytes in yourCorpus with strings showing their hex codes.

I hope this helps, for me it does.

tm_map(yourCorpus, function(x) iconv(enc2utf8(x), sub = "byte"))

http://tm.r-forge.r-project.org/faq.html

edited Jul 24 '12 at 14:54

DisplayName

3,093
5
35
42

answered Jul 24 '12 at 14:45

user1374611

251
3
3

score 14 · Answer 3 · answered Sep 08 '15 at 09:11

14

I think it is clear by now that the problem is because of the emojis that tolower is not able to understand

#to remove emojis
dataSet <- iconv(dataSet, 'UTF-8', 'ASCII')

answered Sep 08 '15 at 09:11

Saurabh Yadav

365
4
13

Kenton · Answer 4 · 2013-01-18T16:10:26.477

I have just run afoul of this problem. By chance are you using a machine running OSX? I am and seem to have traced the problem to the definition of the character set that R is compiled against on this operating system (see https://stat.ethz.ch/pipermail/r-sig-mac/2012-July/009374.html)

What I was seeing is that using the solution from the FAQ

tm_map(yourCorpus, function(x) iconv(enc2utf8(x), sub = "byte"))

was giving me this warning:

Warning message:
it is not known that wchar_t is Unicode on this platform

This I traced to the enc2utf8 function. Bad news is that this is a problem with my underlying OS and not R.

So here is what I did as a work around:

tm_map(yourCorpus, function(x) iconv(x, to='UTF-8-MAC', sub='byte'))

This forces iconv to use the utf8 encoding on the macintosh and works fine without the need to recompile.

score 8 · Answer 5 · answered Apr 19 '18 at 22:20

I have often run into this issue and this Stack Overflow post is always what comes up first. I have used the top solution before, but it can strip out characters and replace them with garbage (like converting it’s to itâ€™s).

I have found that there is actually a much better solution for this! If you install the stringi package, you can replace tolower() with stri_trans_tolower() and then everything should work fine.

score 4 · Answer 6 · edited Feb 11 '14 at 13:00

4

I have been running this on Mac and to my frustration,I had to identify the foul record (as these were tweets) to resolve. Since the next time, there is no guarantee of the record being the same, I used the following function

tm_map(yourCorpus, function(x) iconv(x, to='UTF-8-MAC', sub='byte'))

as suggested above.

It worked like a charm

edited Feb 11 '14 at 13:00

PKumar

10,971
6
37
52

answered Aug 08 '13 at 11:59

Krishna Vedula

1,643
1
27
31

score 2 · Answer 7 · answered Nov 23 '16 at 16:07

The former suggestions didn't work for me. I investigated more and found the one that worked in the following https://eight2late.wordpress.com/2015/05/27/a-gentle-introduction-to-text-mining-using-r/

#Create the toSpace content transformer
toSpace <- content_transformer(function(x, pattern) {return (gsub(pattern," ",
x))})
# Apply it for substituting the regular expression given in one of the former answers by " "
your_corpus<- tm_map(your_corpus,toSpace,"[^[:graph:]]")

# the tolower transformation worked!
your_corpus <- tm_map(your_corpus, content_transformer(tolower))

score 2 · Answer 8 · answered Mar 10 '12 at 07:12

This is a common issue with the tm package (1, 2, 3).

One non-R way to fix it is to use a text editor to find and replace all the fancy characters (ie. those with diacritics) in your text before loading it into R (or use gsub in R). For example you'd search and replace all instances of the O-umlaut in Öl-Teppich. Others have had success with this (I have too), but if you have thousands of individual text files obviously this is no good.

For an R solution, I found that using VectorSource instead of DirSource seems to solve the problem:

# I put your example text in a file and tested it with both ANSI and 
# UTF-8 encodings, both enabled me to reproduce your problem
#
tmp <- Corpus(DirSource('C:\\...\\tmp/'))
tmp <- tm_map(dataSet, tolower)
Error in FUN(X[[1L]], ...) : 
  invalid input 'RT @noXforU Erneut riesiger (Alt-)Öl–teppich im Golf von Mexiko (#pics vom Freitag) http://bit.ly/bw1hvU http://bit.ly/9R7JCf #oilspill #bp' in 'utf8towcs'
# quite similar error to what you got, both from ANSI and UTF-8 encodings
#
# Now try VectorSource instead of DirSource
tmp <- readLines('C:\\...\\tmp.txt') 
tmp
[1] "RT @noXforU Erneut riesiger (Alt-)Öl–teppich im Golf von Mexiko (#pics vom Freitag) http://bit.ly/bw1hvU http://bit.ly/9R7JCf #oilspill #bp"
# looks ok so far
tmp <- Corpus(VectorSource(tmp))
tmp <- tm_map(tmp, tolower)
tmp[[1]]
rt @noxforu erneut riesiger (alt-)öl–teppich im golf von mexiko (#pics vom freitag) http://bit.ly/bw1hvu http://bit.ly/9r7jcf #oilspill #bp
# seems like it's worked just fine. It worked for best for ANSI encoding. 
# There was no error with UTF-8 encoding, but the Ö was returned 
# as ã– which is not good

But this seems like a bit of a lucky coincidence. There must be a more direct way about it. Do let us know what works for you!

Thanks for your reply Ben! For some reason, that same line of code that failed for me works now. I don't know if this is another lucky coincidence :) I didn't change anything, just rerun it and this time it works without any hiccups. — maiaini, Mar 15 '12 at 10:46

score 1 · Answer 9 · edited Sep 05 '12 at 01:18

1

Use the following steps:

# First you change your document in .txt format with encoding UFT-8
library(tm)
# Set Your directoryExample ("F:/tmp").
dataSet <- Corpus(DirSource ("/tmp"), readerControl=list(language="english)) # "/tmp" is your directory. You can use any language in place of English whichever allowed by R.
dataSet <- tm_map(dataSet, tolower)

Inspect(dataSet)

edited Sep 05 '12 at 01:18

jonsca

10,218
26
54
62

answered Aug 23 '12 at 06:04

Ashutosh Agrahari

29
2

score 1 · Answer 10 · answered May 06 '13 at 10:21

The official FAQ seems to be not working in my situation:

tm_map(yourCorpus, function(x) iconv(x, to='UTF-8-MAC', sub='byte'))

Finally I made it using the for & Encoding function:

for (i in 1:length(dataSet))
{
  Encoding(corpus[[i]])="UTF-8"
}
corpus <- tm_map(dataSet, tolower)

score 1 · Answer 11 · edited Oct 31 '12 at 15:56

If it's alright to ignore invalid inputs, you could use R's error handling. e.g:

  dataSet <- Corpus(DirSource('tmp/'))
  dataSet <- tm_map(dataSet, function(data) {
     #ERROR HANDLING
     possibleError <- tryCatch(
         tolower(data),
         error=function(e) e
     )

     # if(!inherits(possibleError, "error")){
     #   REAL WORK. Could do more work on your data here,
     #   because you know the input is valid.
     #   useful(data); fun(data); good(data);
     # }
  })

There is an additional example here: http://gastonsanchez.wordpress.com/2012/05/29/catching-errors-when-using-tolower/

score 0 · Answer 12 · answered Aug 27 '15 at 08:40

Chad's solution wasn't working for me. I had this embedded in a function and it was giving an error about iconv neededing a vector as input. So, I decided to do the conversion before creating the corpus.

myCleanedText <- sapply(myText, function(x) iconv(enc2utf8(x), sub = "byte"))

score 0 · Answer 13 · answered Feb 05 '18 at 23:32

I was able to fix it by converting the data back to plain text format using this line of code

corpus <- tm_map(corpus, PlainTextDocument)

thanks to user https://stackoverflow.com/users/4386239/paul-gowder

for his response here

https://stackoverflow.com/a/29529990/815677

score 0 · Answer 14 · answered Jun 06 '20 at 10:11

I had the same problem in my mac, solved via below solution.

raw_data <- read.csv(file.choose(), stringsAsFactors = F,  encoding="UTF-8")

raw_data$textCol<- iconv(raw_data$textCol, "ASCII", "UTF-8", sub="byte")

data_corpus <- VCorpus(VectorSource(raw_data$textCol))

corpus_clean = tm_map(data_corpus, function(x) iconv(x, to='UTF-8-MAC', sub='byte'))

corpus_clean <- tm_map(data_corpus, content_transformer(tolower))

R tm package invalid input in 'utf8towcs'

14 Answers14

Linked