5

Trying to create wordcloud from twitter data, but get the following error:

Error in FUN(X[[72L]], ...) : 
  invalid input '������������❤������������ "@xxx:bla, bla, bla... http://t.co/56Fb78aTSC"' in 'utf8towcs' 

This error appears after running the "mytwittersearch_corpus<- tm_map(mytwittersearch_corpus, tolower)" code

mytwittersearch_list <-sapply(mytwittersearch, function(x) x$getText())

mytwittersearch_corpus <-Corpus(VectorSource(mytwittersearch_corpus_list))
mytwittersearch_corpus<-tm_map(mytwittersearch_corpus, tolower)
mytwittersearch_corpus<-tm_map( mytwittersearch_corpus, removePunctuation)
mytwittersearch_corpus <-tm_map(mytwittersearch_corpus, function(x) removeWords(x, stopwords()))

I read on other pages this may be due to R having difficulty processing symbols, emoticons and letters in non-English languages, but this appears not to be the problem with the "error tweets" that R has issues with. I did run the codes:

mytwittersearch_corpus <- tm_map(mytwittersearch_corpus, function(x) iconv(enc2utf8(x), sub = "byte"))
mytwittersearch_corpus<- tm_map(mytwittersearch_corpus, content_transformer(function(x)    iconv(enc2utf8(x), sub = "bytes")))

These do not help. I also get that it can't find function content_transformer even though the tm-package is checked off and running.

I'm running this on OS X 10.6.8 and using the latest RStudio.

Anne Boysen
  • 105
  • 1
  • 2
  • 11
  • Maybe you wanna try the tolower function from the stringi package: `tm_map(mytwittersearch_corpus, content_transformer(stringi::stri_trans_tolower))`. – lukeA Jan 03 '15 at 16:30
  • 1
    `content_transformer` is relatively new. You may need to update the package. What is `packageVersion("tm")`? – Rich Scriven Jan 03 '15 at 16:57
  • As Richard says, it is probably more important to post the version of R and of the packages that are loaded. The `sessionInfo()` function is the easiest way to gather and present that information. – IRTFM Jan 03 '15 at 17:35
  • Thank you for your help, Richard! I tried to run the code, but unfortunately I get the same message. I will try to update the tm package though. Could the absence of content_transformer explain the error, maybe? – Anne Boysen Jan 03 '15 at 17:36
  • See if you get the same sort of error with a reduced version of `mytwittersearch`, perhaps `small <- head(mytwittersearch)`. If so, then you should post the output of `dput(small)` – IRTFM Jan 03 '15 at 17:38
  • Thank you BonedDust. the dput gives:structure(list(structure("#Budget cuts and #veterans preference may be keeping #Millennials out of the federal workforce http://t.co/sU7DCLm4H2 @WashingtonPost", Author = character(0), DateTimeStamp = structure(list( sec = 7.71148109436035, min = 34L, hour = 17L, mday = 3L, mon = 0L, year = 115L, wday = 6L, yday = 2L, isdst = 0L), .Names = c("sec", "min", "hour", "mday", "mon", "year", "wday", "yday", "isdst" ), ............ – Anne Boysen Jan 03 '15 at 17:51
  • Exact duplicate of [Error converting text to lowercase with tm\_map(..., tolower)](http://stackoverflow.com/questions/13640188/error-converting-text-to-lowercase-with-tm-map-tolower) – smci Jul 21 '16 at 20:28

8 Answers8

10

I use this code to get rid of the problem characters:

tweets$text <- sapply(tweets$text,function(row) iconv(row, "latin1", "ASCII", sub=""))
Nizam
  • 4,569
  • 3
  • 43
  • 60
RUser
  • 588
  • 1
  • 4
  • 17
  • Unfortunately it didn't work for me, but I wonder, does it get rid of emoji too or only alphabetic characters? – Anne Boysen Jan 05 '15 at 04:26
  • It removes all the emoji for me. – RUser Jan 05 '15 at 07:47
  • this solution works but only for latin characters. in case of frequent utf-8 conversion problems, the simplest code for conversion seems to be: `tweets$text <- iconv(tweets$text, "ASCII", "UTF-8", sub="byte")` – Agile Bean Mar 19 '18 at 15:08
2

A nice example on creating wordcloud from Twitter data is here. Using the example, and the code below, and passing the tolower parameter while creating the TermDocumentMatrix, I could create a Twitter wordcloud.

library(twitteR)
library(tm)
library(wordcloud)
library(RColorBrewer)
library(ggplot2)


#Collect tweets containing 'new year'
tweets = searchTwitter("new year", n=50, lang="en")

#Extract text content of all the tweets
tweetTxt = sapply(tweets, function(x) x$getText())

#In tm package, the documents are managed by a structure called Corpus
myCorpus = Corpus(VectorSource(tweetTxt))

#Create a term-document matrix from a corpus
tdm = TermDocumentMatrix(myCorpus,control = list(removePunctuation = TRUE,stopwords = c("new", "year", stopwords("english")), removeNumbers = TRUE, tolower = TRUE))

#Convert as matrix
m = as.matrix(tdm)

#Get word counts in decreasing order
word_freqs = sort(rowSums(m), decreasing=TRUE) 

#Create data frame with words and their frequencies
dm = data.frame(word=names(word_freqs), freq=word_freqs)

#Plot wordcloud
wordcloud(dm$word, dm$freq, random.order=FALSE, colors=brewer.pal(8, "Dark2"))

enter image description here

  • Thanks Deb, I tried to run it, but get the same 'tolower' problem: Error in tolower(txt) : invalid input 'My New Year hope: : : : To get ✨@Harry_Styles follow✨ It's my dream since 2013.. I love him from bottom of my heart❤️ please������������������' in 'utf8towcs'" Wish I could find a way to make it work, but there seems to be something wrong with the "tm" package. – Anne Boysen Jan 04 '15 at 05:25
  • I could successfully run the above code and get the wordcloud in Windows XP OS and RStudio. –  Jan 04 '15 at 06:59
2

Have you tried updating tm and using stri_trans_tolower from stringi?

library(twitteR)
library(tm)
library(stringi)
setup_twitter_oauth("CONSUMER_KEY", "CONSUMER_SECRET")
mytwittersearch <- showStatus(551365749550227456) 
mytwittersearch_list <- mytwittersearch$getText()
mytwittersearch_corpus <- Corpus(VectorSource(mytwittersearch_list))

mytwittersearch_corpus <- tm_map(mytwittersearch_corpus, content_transformer(tolower))
# Error in FUN(content(x), ...) : 
#   invalid input 'í ½í±…í ¼í¾¯â¤í ¼í¾§í ¼í½œ "@comScore: Nearly half of #Millennials do at least some of their video viewing from a smartphone or tablet: http://t.co/56Fb78aTSC"' in 'utf8towcs'

mytwittersearch_corpus <- tm_map(mytwittersearch_corpus, content_transformer(stri_trans_tolower))
inspect(mytwittersearch_corpus)
# <<VCorpus (documents: 1, metadata (corpus/indexed): 0/0)>>
#   
# [[1]]
# <<PlainTextDocument (metadata: 7)>>
# <ed><U+00A0><U+00BD><ed><U+00B1><U+0085><ed><U+00A0><U+00BC><ed><U+00BE><U+00AF><U+2764><ed><U+00A0><U+00BC><ed><U+00BE><U+00A7><ed><U+00A0><U+00BC><ed><U+00BD><U+009C> "@comscore: nearly half of #millennials do at least some of their video viewing from a smartphone or tablet: http://t.co/56fb78atsc"
lukeA
  • 53,097
  • 5
  • 97
  • 100
  • Stringi didn't help either :( It's strange because I still get the response: Error in match.fun(FUN) : could not find function "content_transformer". Could the content_transformer be found in any alternative package? – Anne Boysen Jan 05 '15 at 04:24
  • May I ask, is content_trasformer listed as part of your tm=package? When I search for it under my tm-package I get "no results found". I thought this was strange. It would be very interesting to compare if this comes up for others. Then I know it's something wrong with my tm package. – Anne Boysen Jan 05 '15 at 04:42
  • @AnneBoysen http://cran.r-project.org/web/packages/tm/tm.pdf#page=4 - yes it's part of `tm`. – lukeA Jan 05 '15 at 06:19
  • Now, with tm package version 0.5-10, you can simply use : mytwittersearch_corpus <- tolower(mytwittersearch_corpus) . This should work fine. – Rupesh Kumar Sep 16 '18 at 13:49
2

The above solutions may have worked but not anymore in the newest versions of wordcloud and tm.

This problem almost made me crazy, but I found a solution and want to explain it the best I can to save anyone becoming desperate.

The function which is implicitly called by wordcloud and responsible for throwing the error

 Error in FUN(content(x), ...) : in 'utf8towcs'

is this one:

words.corpus <- tm_map(words.corpus, tolower)

which is a shortcut for

words.corpus <- tm_map(words.corpus, content_transformer(tolower))

To provide a reproducible example, here's a function that embeds the solution:

plot_wordcloud <- function(words, max_words = 70, remove_words ="",
                           n_colors = 5, palette = "Set1")
{
    require(dplyr)
    require(wordcloud)
    require(RColorBrewer) # for brewer.pal()
    require(tm) # for tm_map()

    # Solution: remove all non-printable characters in UTF-8 with this line
    words <- iconv(words, "ASCII", "UTF-8", sub="byte")

    wc <- wordcloud(words=words.corpus, max.words=max_words,
                    random.order=FALSE,
                    colors = brewer.pal(n_colors, palette),
                    random.color = FALSE,
                    scale=c(5.5,.5), rot.per=0.35) %>% recordPlot
    return(wc)
}

Here's what failed:

I tried to convert the text BEFORE and AFTER creating the corpus with

words.corpus <- Corpus(VectorSource(words))

BEFORE:

Converting to UTF-8 on the text didn't work with:

words <- sapply(words, function(x) iconv(enc2utf8(x), sub = "byte"))

nor

for (i in 1:length(words))
{
    Encoding(words[[i]])="UTF-8"
}

AFTER:

Converting to UTF-8 on the corpus didn't work with:

    words.corpus <- tm_map(words.corpus, removeWords, remove_words)

nor

    words.corpus <- tm_map(words.corpus, content_transformer(stringi::stri_trans_tolower))

nor

    words.corpus <- tm_map(words.corpus, function(x) iconv(x, to='UTF-8'))

nor

    words.corpus <- tm_map(words.corpus, enc2utf8)

nor

    words.corpus <- tm_map(words.corpus, tolower)

All these solutions may have worked at a certain point in time, so I don't want to discredit the authors. They may work some time in the future. But why they didn't work is almost impossible to say because there were good reasons why they were supposed to work. Anyway, just remember to convert the text before creating the corpus with:

    words <- iconv(words, "ASCII", "UTF-8", sub="byte")

Disclaimer: I got the solution with more detailed explanation here: http://www.textasdata.com/2015/02/encoding-headaches-emoticons-and-rs-handling-of-utf-816/

Agile Bean
  • 6,437
  • 1
  • 45
  • 53
0

I ended up with updating my RStudio and packages. This seemed to solve the tolower/ content_transformer issues. I read somewhere that the last tm-package had some issues with tm_map, so maybe that was the problem. In any case, this worked!

Anne Boysen
  • 105
  • 1
  • 2
  • 11
0

Instead of

corp <- tm_map(corp, content_transformer(tolower), mc.cores=1)

use

corp <- tm_map(corp, tolower, mc.cores=1)
Michael Davidson
  • 1,391
  • 1
  • 14
  • 31
0

While using code similar to that above and working on a word cloud shiny app which ran fine on my own pc, but didn't work either on amazon aws or shiny apps.io, I discovered that text with 'accents',e.g. santé in it didn't upload well as csv files to the cloud. I found a solution by saving the files as .txt files and in utf-8 using notepad and re-writing my code to allow for the fact that the files were no longer csv but txt. My versions of R was 3.2.1 and Rstudio was Version 0.99.465

0

Just to mention, I had the same problem in a different context (nothing to do with tm or Twitter). For me, the solution was iconv(x, "latin1", "UTF-8"), even though Encoding() told me it was already UTF-8.

Oliver
  • 1,098
  • 1
  • 11
  • 16