0

I think the tm package is great that it has so many functions that make NLP simpler to implement. However, I am new to this and I am running into a road block. Could someone help?

sms_clean <- tm_map(sms_corpus, content_transformer(tolower))

It gave me the following error message: Error in FUN(content(x), ...) : invalid input 'FreeMsg Hey there darling it's been 3 week's now and no word back! I'd like some fun you up for it still? Tb ok! XxX std chgs to send, �1.50 to rcv' in 'utf8towcs'

I believe there are special characters or emojis that are not UTF-8 format causing the problem. So I tried encoding when I import the file, which I found from here.

sms_raw <- read.csv('spam.csv', 
                encoding = 'Latin-1', 
                stringsAsFactors = FALSE)

It gave me this error message: Error in FUN(content(x), ...) : invalid multibyte string 1

I also tried the following that I found from another stackoverflow site:

usableText <- str_replace_all(sms_corpus,"[^[:graph:]]", " ") 

and dataset <- iconv(sms_corpus, 'UTF-8', 'ASCII') either one helps.

  • If the encoding isn't UTF-8 or Latin-1, do you know what the encoding is? Where did you get the data from? Perhaps you can ask them what the encoding should be. It's hard to say more without some sort of [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) – MrFlick Oct 27 '21 at 02:36
  • Thanks for responding. I found the dataset from https://www.kaggle.com/uciml/sms-spam-collection-dataset. I don't know what the encoding is though. Do you know of other dataset I can play with that's easier to handle, say in terms of encoding? – Sophia Hart Oct 28 '21 at 16:51

0 Answers0