0

I have the following code but I get an error when trying to create a document term matrix: (originally I had the data in a csv file with one column, and did read.csv, but for purposes of replication I created a data frame below)

library(tm)
TEXTS<- as.data.frame(c("I am a cat person", "I like both cats and dogs"), stringsAsFactors = FALSE)
docs<-VCorpus(VectorSource(TEXTS))
docs <- tm_map(docs, removePunctuation) 
docs <- tm_map(docs, removeNumbers) 
docs <- tm_map(docs, content_transformer(tolower), lazy = TRUE)   
docs <- tm_map(docs, PlainTextDocument, lazy = TRUE) 
docs <- tm_map(docs, removeWords, stopwords("english"), lazy = TRUE)  
library(SnowballC)   
docs <- tm_map(docs, stemDocument, language = meta(docs, "english"), lazy = TRUE) 
dtm <- DocumentTermMatrix(docs) 

this is the error I get from the last line:

Error in stemDocument.PlainTextDocument(x, ...) : 
  promise already under evaluation: recursive default argument reference or     earlier problems?
In addition: Warning message:
In stemDocument.PlainTextDocument(x, ...) :
  restarting interrupted promise evaluation

What can I do? thanks

Deb Martin
  • 51
  • 12

1 Answers1

0

Why were you calling the PlainTextDocument function? I removed it, i also deleted the meta reference in the language of the stemming process.

I've re-ordered your code, remember that if you constantly call functions that have as first argument the name of the output variable you can use the pipes %>% from the dplyr package to make your code look smoother (https://cran.r-project.org/web/packages/magrittr/vignettes/magrittr.html)

library(tm)
library(SnowballC)
library(dplyr) #install it if you don't have this package   

TEXTS<- as.data.frame(c("I am a cat person", "I like both cats and dogs"), stringsAsFactors = FALSE)
docs<-VCorpus(VectorSource(TEXTS))
docs <- tm_map(docs, removePunctuation) %>%
  tm_map(removeNumbers) %>%
  tm_map(content_transformer(tolower), lazy = TRUE) %>%
  tm_map(removeWords, stopwords("english"), lazy = TRUE) %>%
  tm_map(stemDocument, language = c("english"), lazy = TRUE) 
dtm <- DocumentTermMatrix(docs)
tia_0
  • 412
  • 1
  • 3
  • 11
  • thanks. If I have my dataframe in a csv file instead of how I displayed it above, could I do : texts<-as.data.frame(read.csv("descriptions.csv", header = TRUE, stringsAsFactors=FALSE)) docs<-VCorpus(VectorSource(texts)). I'm getting this error after dtm<-DocumentTermMatrix(docs) when I use the same code you had above: Error in gsub(sprintf("(*UCP)\\b(%s)\\b", paste(sort(words, decreasing = TRUE), : input string 715 is invalid UTF-8 – Deb Martin Aug 02 '16 at 20:10
  • When you are reading the csv file you're casting it to a dataframe, i suppose that the data frame has multiple column. Y have to pass only the variable containing the text to the `VCorpus` function, maybe that error is triggered from the incorrected input. If you can give an extract of the csv i can try it too. – tia_0 Aug 02 '16 at 20:16
  • My data frame only has one column- however I tried using texts$Long_Description and it still does not work. I'm not sure how I can show you the csv file, but it has one column entitle "Long_Description". These are the first few rows (comma indicates new row): I am a cat person, I like both cats and dogs, I hate pets, I don't have any pets, I only like small animals, I am a dog person; but I also have a pet rabbit – Deb Martin Aug 02 '16 at 20:24
  • I think that the problem is related to the type of characters you have in the .csv file, i tried with the strings you gave me and it worked. [Example of the code running](http://imgur.com/a/FKyTo) – tia_0 Aug 02 '16 at 20:32
  • I went through my data and all the characters are strings. Is there anything I can do in excel to make sure that they are strings? thanks so much – Deb Martin Aug 02 '16 at 20:48
  • Maybe look at this [tutorial](https://www.datacamp.com/community/tutorials/r-data-import-tutorial#gs.tSvUJ1o) from DataCamp, they give some advice on how to prepare the excel file before importing it in R. Look for some special symbols or other element that can corrupt the textining operations. Maybe look also at this question http://stackoverflow.com/questions/9637278/r-tm-package-invalid-input-in-utf8towcs – tia_0 Aug 02 '16 at 20:54