0

I am new to R and is analysing a review dataset. There are some labels of in the dataset and I manage to find solution to replace them with gsub.

But after replacing them and wanted to compute term frequency, the frequent terms became numbers. When check back the str() of the processed dataset it produce the following:

> str(full)
'data.frame':   10000 obs. of  1 variable:
 $ reviewContent: Factor w/ 9884 levels "\"ARS?!\" -- me when hearing"| __truncated__,..: 1941 9580 9393 1938 7192 885 3758 7201 2530 7445 ...

Listed are my code:

text <- subset(full, select = reviewContent) 
text <- as.data.frame(lapply(text, function(x) {gsub("\u00A0", " ", x)}))
corpus <- Corpus(VectorSource(text))
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removeWords, stopwords("english"))
corpus <- tm_map(corpus, stemDocument)
     t <- TermDocumentMatrix(corpus)
     t <- data.matrix(t)
     t <- sort(rowSums(t),decreasing=TRUE)
     t <- data.frame(word = names(t),freq=t)
     head(t, 10)

     and result of term frequency is:
      word freq
  1084 1084    2
  1110 1110    2
  113   113    2
  1203 1203    2
  1255 1255    2
  140   140    2
  1409 1409    2
  1541 1541    2
  1593 1593    2
  1623 1623    2

Really appreciate if anyone is able to solve this problem.

Nniicckk
  • 21
  • 2
  • When asking for help, you should include a simple [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input and desired output that can be used to test and verify possible solutions. But you probably want `as.data.frame(lapply(text, function(x) {gsub("\u00A0", " ", x)}), stiringsAsFactors=FALSE)`. – MrFlick Jun 14 '18 at 19:35
  • Thanks for the advise on reproducible example but this is a review dataset obtained from another party so I cant post it here. And there is no similar review dataset from the R library. By the way, I've tried lappy() before but it returned with the same result. Anyway thanks for the advice. – Nniicckk Jun 14 '18 at 19:45
  • What is the output of `str(t)` after the line `t <- data.matrix(t)`? – LucyMLi Jun 14 '18 at 19:54
  • Just to be sure: try `gsub("(*UCP)\\x{00A0}", " ", x, perl=TRUE)` – Wiktor Stribiżew Jun 14 '18 at 19:59
  • You could try adding the parameter `stringsAsFactors = FALSE` to `as.data.frame`. – Ian Wesley Jun 14 '18 at 21:47
  • Hey guys thanks for all the advises, but I have found the solution for the problem. Basically the numbers showing for the term frequency are column number (I guess because I tried to export as csv and found no similar number except column number). The solution is quite simple, which just needs to applies as_data_frame() instead of as.data.frame to convert the usual df into tbl_df. Hope it helps if any other person encounter similar problem. Thanks again. – Nniicckk Jun 15 '18 at 10:39

0 Answers0