0

I am using R in RStudio and I am the running the following codes to perform a sentiment analysis on a set of unstructured texts. Since the bunch of texts contain some invalid characters (caused by the use of emoticons and other typo errors), I want to remove them before proceeding with the analysis.

My R codes (extract) stand as follows:

setwd("E:/sentiment")

doc1=read.csv("book1.csv", stringsAsFactors = FALSE, header = TRUE)

# replace specific characters in doc1
  doc1<-gsub("[^\x01-\x7F]", "", doc1)

library(tm)

#Build Corpus
corpus<- iconv(doc1$Review.Text, to = 'utf-8')
corpus<- Corpus(VectorSource(corpus))

I get the following error message when I reach this line of code corpus<- iconv(doc1$Review.Text, to = 'utf-8'):

Error in doc1$Review.Text : $ operator is invalid for atomic vectors

I had a look at the following StackOverflow questions:

remove emoticons in R using tm package

Replace specific characters within strings

I have also tried the following to clean my texts before running the tm package, but I am getting the same error: doc1<-iconv(doc1, "latin1", "ASCII", sub="")

How can I solve this issue?

user3115933
  • 4,303
  • 15
  • 54
  • 94
  • 3
    with `doc1<-gsub("[^\x01-\x7F]", "", doc1)` you overwrite `doc1`, from this on it is not a dataframe but a character vector. `doc1<-gsub("[^\x01-\x7F]", "", iris); str(doc1)` – jogo Jun 04 '19 at 09:07
  • 1
    `doc1` is a `data.frame` and I guess that you want to apply `gsub` on the columns of doc1. If you apply `gsub` (which expects a character vector) directly on `doc1`, it gets coerced to a character vector and therefore the error. – nicola Jun 04 '19 at 09:09
  • Getting your point. I guess then it should be doc1$Review.Text – user3115933 Jun 04 '19 at 09:10
  • @jogo Thanks. Please elaborate as an answer and I vote accordingly. – user3115933 Jun 04 '19 at 09:12

1 Answers1

0

With 

doc1<-gsub("[^\x01-\x7F]", "", doc1)

 you overwrite the object doc1, from this on it is not a dataframe but a character vector; see:

doc1 <- gsub("[^\x01-\x7F]", "", iris)
str(doc1)

and now clear

doc1$Species

produces the error.
Eventually you want to do:

doc1$Review.Text <- gsub("[^\x01-\x7F]", "", doc1$Review.Text)
jogo
  • 12,469
  • 11
  • 37
  • 42
  • If I understand correctly, I should run gsub on my target column in the data frame. In this case it is column Review.Text So my code should look like this: doc1<-gsub("[^\x01-\x7F]", "", doc1$Review.Text) – user3115933 Jun 04 '19 at 09:22
  • 1
    No, also with `doc1 <- ... whatever` you are overwriting your dataframe `doc1`. See my edited answer. – jogo Jun 04 '19 at 09:27