I am trying to use tm's DocumentTermMatrix function to produce a matrix with bigrams instead of unigrams. I have tried to use the examples outlined here and here in my function (here are three examples):
make_dtm = function(main_df, stem=F){
tokenize_ngrams = function(x, n=2) return(rownames(as.data.frame(unclass(textcnt(x,method="string",n=n)))))
decisions = Corpus(VectorSource(main_df$CaseTranscriptText))
decisions.dtm = DocumentTermMatrix(decisions, control = list(tokenize=tokenize_ngrams,
stopwords=T,
tolower=T,
removeNumbers=T,
removePunctuation=T,
stemming = stem))
return(decisions.dtm)
}
make_dtm = function(main_df, stem=F){
BigramTokenizer = function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
decisions = Corpus(VectorSource(main_df$CaseTranscriptText))
decisions.dtm = DocumentTermMatrix(decisions, control = list(tokenize=BigramTokenizer,
stopwords=T,
tolower=T,
removeNumbers=T,
removePunctuation=T,
stemming = stem))
return(decisions.dtm)
}
make_dtm = function(main_df, stem=F){
BigramTokenizer = function(x) unlist(lapply(ngrams(words(x), 2), paste, collapse = " "), use.names = FALSE)
decisions = Corpus(VectorSource(main_df$CaseTranscriptText))
decisions.dtm = DocumentTermMatrix(decisions, control = list(tokenize=BigramTokenizer,
stopwords=T,
tolower=T,
removeNumbers=T,
removePunctuation=T,
stemming = stem))
return(decisions.dtm)
}
Rather unfortunately, however, each of these three versions of the function produces the exact same output: a DTM with unigrams, rather than bigrams (image included for simplicity):
For your convenience, here is a subset of the data that I am working with:
x = data.frame("CaseName" = c("Attorney General's Reference (No.23 of 2011)", "Attorney General's Reference (No.31 of 2016)", "Joseph Hill & Co Solicitors, Re"),
"CaseID"= c("[2011]EWCACrim1496", "[2016]EWCACrim1386", "[2013]EWCACrim775"),
"CaseTranscriptText" = c("sanchez 2011 02187 6 appeal criminal division 8 2011 2011 ewca crim 14962011 wl 844075 wales wednesday 8 2011 attorney general reference 23 2011 36 criminal act 1988 representation qc general qc appeared behalf attorney general",
"attorney general reference 31 2016 201601021 2 appeal criminal division 20 2016 2016 ewca crim 13862016 wl 05335394 dbe honour qc sitting cacd wednesday 20 th 2016 reference attorney general 36 criminal act 1988 representation",
"matter wasted costs against company solicitors 201205544 5 appeal criminal division 21 2013 2013 ewca crim 7752013 wl 2110641 date 21 05 2013 appeal honour pawlak 20111354 hearing date 13 th 2013 representation toole respondent qc appellants"))