How can I solve this R error message relating to atomic vectors?

Question

I am using R in RStudio and I am the running the following codes to perform a sentiment analysis on a set of unstructured texts. Since the bunch of texts contain some invalid characters (caused by the use of emoticons and other typo errors), I want to remove them before proceeding with the analysis.

My R codes (extract) stand as follows:

setwd("E:/sentiment")

doc1=read.csv("book1.csv", stringsAsFactors = FALSE, header = TRUE)

# replace specific characters in doc1
  doc1<-gsub("[^\x01-\x7F]", "", doc1)

library(tm)

#Build Corpus
corpus<- iconv(doc1$Review.Text, to = 'utf-8')
corpus<- Corpus(VectorSource(corpus))

I get the following error message when I reach this line of code corpus<- iconv(doc1$Review.Text, to = 'utf-8'):

Error in doc1$Review.Text : $ operator is invalid for atomic vectors

I had a look at the following StackOverflow questions:

remove emoticons in R using tm package

Replace specific characters within strings

I have also tried the following to clean my texts before running the tm package, but I am getting the same error: doc1<-iconv(doc1, "latin1", "ASCII", sub="")

How can I solve this issue?

with `doc1<-gsub("[^\x01-\x7F]", "", doc1)` you overwrite `doc1`, from this on it is not a dataframe but a character vector. `doc1<-gsub("[^\x01-\x7F]", "", iris); str(doc1)` — jogo, Jun 04 '19 at 09:07
`doc1` is a `data.frame` and I guess that you want to apply `gsub` on the columns of doc1. If you apply `gsub` (which expects a character vector) directly on `doc1`, it gets coerced to a character vector and therefore the error. — nicola, Jun 04 '19 at 09:09
Getting your point. I guess then it should be doc1$Review.Text — user3115933, Jun 04 '19 at 09:10
@jogo Thanks. Please elaborate as an answer and I vote accordingly. — user3115933, Jun 04 '19 at 09:12

jogo · Accepted Answer · 2019-06-04T09:24:13.450

0

With

doc1<-gsub("[^\x01-\x7F]", "", doc1)

you overwrite the object doc1, from this on it is not a dataframe but a character vector; see:

doc1 <- gsub("[^\x01-\x7F]", "", iris)
str(doc1)

and now clear

doc1$Species

produces the error.
Eventually you want to do:

doc1$Review.Text <- gsub("[^\x01-\x7F]", "", doc1$Review.Text)

edited Jun 04 '19 at 09:24

answered Jun 04 '19 at 09:18

jogo

12,469
11
37
42

If I understand correctly, I should run gsub on my target column in the data frame. In this case it is column Review.Text So my code should look like this: doc1<-gsub("[^\x01-\x7F]", "", doc1$Review.Text) – user3115933 Jun 04 '19 at 09:22
1

No, also with `doc1 <- ... whatever` you are overwriting your dataframe `doc1`. See my edited answer. – jogo Jun 04 '19 at 09:27

How can I solve this R error message relating to atomic vectors?

1 Answers1