Creating TermDocumentMatrix: issue with number of documents

Question

I'm attempting to create a term document matrix with a text file that is about 3+ million lines of text. I have created a random sample of the text, which results in about 300,000 lines.

Unfortunately when use the following code I end up with 300,000 documents. I just want 1 document with the frequencies for each bigram:

library(RWeka)
library(tm)

corpus <- readLines("myfile")
numberLinesCorpus <- 3000000
corpus_sample <- text_corpus[sample(1:numberLinesCorpus, numberLinesCorpus*.1, replace = FALSE)]
myCorpus <- Corpus(VectorSource(corpus_sample))
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 2))
tdm <- TermDocumentMatrix(myCorpus, control = list(tokenize = BigramTokenizer))

The sample contains approximately 300,000 lines. However, the number of documents in tdm is also 300,000.

Any help would be much appreciated.

score 1 · Accepted Answer · edited Jul 15 '15 at 09:08

1

You'll need to use the paste function on your corpus_sample vector.

Paste, with a value set for collapse takes a vector with many text elements and converts it to a vector with one text elements, where the elements are separated by the string you specify.

text <- c('a', 'b', 'c')
text <- paste(text, collapse = " ")
text
# [1] "a b c"

edited Jul 15 '15 at 09:08

answered Jul 15 '15 at 09:01

Mhairi McNeill

1,951
11
20

Now I have the following error message: Error in ls(envir = envir, all.names = private) : invalid 'envir' argument Error during wrapup: cannot open the connection – statsguyz Jul 15 '15 at 11:59
Hm, error in ls? That seems strange. Where exactly are you getting the error message? If you could rewrite the question with some data to make it reproducible it would be easier to help fix this. http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example – Mhairi McNeill Jul 15 '15 at 12:59
I was able to create the DocumentTermMatrix without the tokenizer = Bigram. – statsguyz Jul 15 '15 at 13:08

score 0 · Answer 2 · answered Jul 20 '15 at 09:34

You can also use the quanteda package, as an alternative to tm. That will do what you want in the following steps, after you've created corpus_sample:

require(quanteda)
myDfm <- dfm(corpus_sample, ngrams = 2)
bigramTotals <- colSums(myDfm)

I also suspect it will be faster.

Creating TermDocumentMatrix: issue with number of documents

2 Answers2