I'm attempting to create a term document matrix with a text file that is about 3+ million lines of text. I have created a random sample of the text, which results in about 300,000 lines.
Unfortunately when use the following code I end up with 300,000 documents. I just want 1 document with the frequencies for each bigram:
library(RWeka)
library(tm)
corpus <- readLines("myfile")
numberLinesCorpus <- 3000000
corpus_sample <- text_corpus[sample(1:numberLinesCorpus, numberLinesCorpus*.1, replace = FALSE)]
myCorpus <- Corpus(VectorSource(corpus_sample))
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 2))
tdm <- TermDocumentMatrix(myCorpus, control = list(tokenize = BigramTokenizer))
The sample contains approximately 300,000 lines. However, the number of documents in tdm is also 300,000.
Any help would be much appreciated.