How to remove error in term-document matrix in R?

Question

I am trying to create Term-Document matrix using R from a corpus of file. But on running the code I am getting this error followed by 2 warnings:

Error in simple_triplet_matrix(i = i, j = j, v = as.numeric(v), nrow = length(allTerms),  : 
 'i, j' invalid
 Calls: DocumentTermMatrix ... TermDocumentMatrix.VCorpus ->    simple_triplet_matrix -> .Call
In addition: Warning messages:
1: In mclapply(unname(content(x)), termFreq, control) :
 scheduled core 1 encountered error in user code, all values of the job will be affected
2: In simple_triplet_matrix(i = i, j = j, v = as.numeric(v), nrow =   length(allTerms),  :
NAs introduced by coercion

My code is given below:

library(tm)
library(RWeka)
library(tmcn.word2vec)

#Reading data
data <- read.csv("Train.csv", header=T)
#text <- data$EventDescription

#Pre-processing
corpus <- Corpus(VectorSource(data$EventDescription))
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, PlainTextDocument)
#dataframe <- data.frame(text=unlist(sapply(corpus,'[',"content")))

#Reading dictionary file
 dict <- scan("dictionary.txt", what='character',sep='\n')

#Bigram Tokenization
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 4))
tdm_doc <- DocumentTermMatrix(corpus,control=list(stopwords = dict,    tokenize=BigramTokenizer))
tdm_dic <- DocumentTermMatrix(corpus,control=list(tokenize=BigramTokenizer, dictionary=dict))

As given in other answers in SO, I have tried installing SnowballC package and other listed ideas. Still I am getting the same error. Can anyone help me in this regard? Thanks in advance.

please post enough of the input files so that one can reproduce the error — pcantalupo, Sep 11 '15 at 12:37
For example post the value of `dput(head(data))`. But first try and see if you get the error when you use only the `head` of `data`. — asachet, Sep 11 '15 at 12:40
Looks like a parallel issue. check this [post](http://stackoverflow.com/questions/25069798/r-tm-in-mclapplycontentx-fun-all-scheduled-cores-encountered-errors) or this [post](http://stackoverflow.com/questions/17703553/bigrams-instead-of-single-words-in-termdocument-matrix-using-r-and-rweka). — phiver, Sep 11 '15 at 12:43

score 18 · Answer 1 · answered Apr 22 '17 at 09:48

18

I had the same problem for getting my DocumnetTermMatrix and I solved it by removing the following command:

corpus <- tm_map(corpus, PlainTextDocument)

answered Apr 22 '17 at 09:48

raha.rah

418
3
9

2

I tried this, that is I commented out the line and reran that section of code. It did not work. Then I cleared my data, and reran the script from the beginning. On this second run, this solution worked! – Mike Goldweber Oct 13 '18 at 18:31

score 13 · Answer 2 · answered Mar 22 '16 at 18:00

13

I had a similar error when cleaning a corpus. To fix the problem I added the following after the offending line of code and it fixed it. Some of the tm_map functions do not return a corpus...

corpus <- Corpus(VectorSource(corpus))

For me the problem arose after stem completion. I would suggest trying to make a tdm after each tm_map call. That will tell you which cleaning step is causing the problem.

Best of luck!

answered Mar 22 '16 at 18:00

emilliman5

5,816
3
27
37

1

I have tried to diagnose the tm_map which produces my problem in the way you say. It was this: corpus <- tm_map(corpus, PlainTextDocument) – lbcommer Apr 13 '17 at 11:08

How to remove error in term-document matrix in R?

2 Answers2

Linked