0

I keep getting this error when trying to import a csv document into R and trying to develop a corpus for topic modeling. I have used this approach successfully on 4 other projects but cannot get past this error. My data source has a doc_id column and a text column. Error is Error:

all(!is.na(match(c("doc_id", "text"), names(x)))) is not TRUE

I tried importing using a a number of different suggestions such as Error faced while using TM package's VCorpus in R

file_loc <- "C:\\Users\\mdlawrence\\Desktop\\Test2.csv "

 x <- read.csv(file_loc, header = TRUE, stringsAsFactors = F)

 require(tm)
 Loading required package: tm

 Loading required package: NLP

 corp <- Corpus(DataframeSource(x))

Error: all(!is.na(match(c("doc_id", "text"), names(x)))) is not TRUE

docs <- DocumentTermMatrix(corp)

Error in TermDocumentMatrix(x, control) : object 'corp' not found

I expect to see a corpus with one document per row in the .csv file. Any suggestions are greatly appreciated.

NelsonGon
  • 13,015
  • 7
  • 27
  • 57
  • Could you add a sample of `x` with `head(x)` or better a `dput`? – NelsonGon May 20 '19 at 10:51
  • 1
    As far as I remember with the `tm` package the column names must be `doc_id` and `text`. What is the column name of the `text` part? Try also `VCorpus(DataframeSource(x))`. Check out this answer: https://stackoverflow.com/questions/47406555/error-faced-while-using-tm-packages-vcorpus-in-r which seems to have the same problem as you face. – user113156 May 20 '19 at 10:52
  • 1
    Possible duplicate of [Error faced while using TM package's VCorpus in R](https://stackoverflow.com/questions/47406555/error-faced-while-using-tm-packages-vcorpus-in-r) – NelsonGon May 20 '19 at 10:54
  • @NelsonGon structure(list(X.doc_id. = c("1A", "2A",… ), X.text. = c("I think a conversation needs to be had to bring all employee groups up to the same … 0 feet" )), . Names = c("X.doc_id.", "X.text."), class = "data.frame", row.names = c(NA, -100L)) – Matthew Lawrence May 20 '19 at 12:45
  • @NelsonGon also when I try the solution presented in the possible duplicate problem I get this error. Error in data.frame(doc_id = row.names(xdata), text = x$text) : arguments imply differing number of rows: 100, 0 – Matthew Lawrence May 20 '19 at 12:49
  • @user113156 the column name is text. I saw that post which is why I changed the column names. I am at a loss. Thank you for your input. – Matthew Lawrence May 20 '19 at 12:55
  • 1
    You can [edit] the question to include a `dput` of a representative sample of the data. One row isn't enough to test out your code on. [See here](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) on making an R post folks can more easily help with. – camille May 20 '19 at 15:21

1 Answers1

1

Its a name issue with your column names. The dput you provide shows this also (X.doc_id) and (X.text). So running the following produces the same error as you were experiencing.

x <- structure(list(X.doc_id. = c("1A", "2A"), 
                    X.text. = c("I think a conversation needs to be had to bring all employee groups up to the same … 0 feet" )),
               .Names = c("X.doc_id.", "X.text."), class = "data.frame", row.names = c(NA, -10L))


library(tm)
VCorpus(DataframeSource(x))

Error in inherits(x, "Source") : all(!is.na(match(c("doc_id", "text"), names(x)))) is not TRUE

Running the following gets the correct result you are looking for.

colnames(x) <- c("doc_id", "text") 
library(tm)
Y <- VCorpus(DataframeSource(x))
Y

<> Metadata: corpus specific: 0, document level (indexed): 0 Content: documents: 10

Running:

content(Y)
content(Y[[1]])

Gives:

content(Y[[1]]) [1] "I think a conversation needs to be had to bring all employee groups up to the same … 0 feet"

Creating the documenttermmatrix:

dtm <- DocumentTermMatrix(Y)
dtm

<> Non-/sparse entries: 11/99 Sparsity : 90% Maximal term length: 12 Weighting
: term frequency (tf)

Somewhere in your code or loading in the data you may have had duplicate column names and reading in the data automatically creates an X.doc_id column to prevent duplicate column names.

user113156
  • 6,761
  • 5
  • 35
  • 81