I am new to R and is analysing a review dataset. There are some labels of in the dataset and I manage to find solution to replace them with gsub.
But after replacing them and wanted to compute term frequency, the frequent terms became numbers. When check back the str() of the processed dataset it produce the following:
> str(full)
'data.frame': 10000 obs. of 1 variable:
$ reviewContent: Factor w/ 9884 levels "\"ARS?!\" -- me when hearing"| __truncated__,..: 1941 9580 9393 1938 7192 885 3758 7201 2530 7445 ...
Listed are my code:
text <- subset(full, select = reviewContent)
text <- as.data.frame(lapply(text, function(x) {gsub("\u00A0", " ", x)}))
corpus <- Corpus(VectorSource(text))
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removeWords, stopwords("english"))
corpus <- tm_map(corpus, stemDocument)
t <- TermDocumentMatrix(corpus)
t <- data.matrix(t)
t <- sort(rowSums(t),decreasing=TRUE)
t <- data.frame(word = names(t),freq=t)
head(t, 10)
and result of term frequency is:
word freq
1084 1084 2
1110 1110 2
113 113 2
1203 1203 2
1255 1255 2
140 140 2
1409 1409 2
1541 1541 2
1593 1593 2
1623 1623 2
Really appreciate if anyone is able to solve this problem.