Bigram Tokenization is not obtained

Question

I am trying to do Bigram Tokenization on csv file. But it is taking a lot of time. I have checked my code with the existing codes in SO. I couldn't find any fault in it. My code is displayed below:

library(tm)
library(RWeka)
library(tmcn.word2vec)
library(openNLP)
library(NLP)

data <- read.csv("Train.csv", header=T)

corpus <- Corpus(VectorSource(data$EventDescription))
corpus <- tm_map(corpus,content_transformer(tolower))
corpus <- tm_map(corpus,removePunctuation)
corpus <- tm_map(corpus,PlainTextDocument)

BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
dtm <- DocumentTermMatrix(corpus,control=list(tokenize=BigramTokenizer))

Can anyone help me in solving this problem? Thanks in advance

I'm not clear on the problem here. Are you saying this works but takes a long time? How long does it take? How big is your corpus? — jlhoward, Sep 09 '15 at 08:58
I have 1236 corpus. But the code isn't working for such a huge data — Athira, Sep 09 '15 at 09:03

score 0 · Answer 1 · answered Jun 08 '17 at 08:42

0

Consider reading the sources into a VCorpus instead of a Corpus. See:

Document-term matrix in R - bigram tokenizer not working and Creating N-Grams with tm & RWeka - works with VCorpus but not Corpus

answered Jun 08 '17 at 08:42

hongsy

1,498
1
27
39

Bigram Tokenization is not obtained

1 Answers1