I am trying to generate a list of all unigrams through trigrams in R to, eventually, make a document-phrase matrix with columns including all single words, bigrams, and trigrams.
I expected to find an easy package for this, and have not succeeded. I did end up getting pointed to RWeka, code and output below, but unfortunately this approach drops all unigrams of 2 or 1 character.
Can this be repaired, or do folks know of another road? Thanks!
TrigramTokenizer <- function(x) NGramTokenizer(x,
Weka_control(min = 1, max = 3))
Text = c( "Ab Hello world","Hello ab", "ab" )
tt = Corpus(VectorSource(Text))
tdm <- TermDocumentMatrix( tt,
control = list(tokenize = TrigramTokenizer))
inspect(tdm)
# <<TermDocumentMatrix (terms: 6, documents: 3)>>
# Non-/sparse entries: 7/11
# Sparsity : 61%
# Maximal term length: 14
# Weighting : term frequency (tf)
# Docs
# Terms 1 2 3
# ab hello 1 0 0
# ab hello world 1 0 0
# hello 1 1 0
# hello ab 0 1 0
# hello world 1 0 0
# world 1 0 0
Here is a version of ngram() from below, edited for optimality (I think). Basically I tried to reuse the strings of tokens to get out of the double-loop when include.all=TRUE.
ngram <- function(tokens, n = 2, concatenator = "_", include.all = FALSE) {
M = length(tokens)
stopifnot( n > 0 )
# if include.all=FALSE return null if nothing to report due to short doc
if ( ( M == 0 ) || ( !include.all && M < n ) ) {
return( c() )
}
# bail if just want original tokens or if we only have one token
if ( (n == 1) || (M == 1) ) {
return( tokens )
}
# set max size of ngram at max length of tokens
end <- min( M-1, n-1 )
all_ngrams <- c()
toks = tokens
for (width in 1:end) {
if ( include.all ) {
all_ngrams <- c( all_ngrams, toks )
}
toks = paste( toks[1:(M-width)], tokens[(1+width):M], sep=concatenator )
}
all_ngrams <- c( all_ngrams, toks )
all_ngrams
}
ngram( c("A","B","C","D"), n=3, include.all=TRUE )
ngram( c("A","B","C","D"), n=3, include.all=FALSE )
ngram( c("A","B","C","D"), n=10, include.all=FALSE )
ngram( c("A","B","C","D"), n=10, include.all=TRUE )
# edge cases
ngram( c(), n=3, include.all=TRUE )
ngram( "A", n=0, include.all=TRUE )
ngram( "A", n=3, include.all=TRUE )
ngram( "A", n=3, include.all=FALSE )
ngram( "A", n=1, include.all=FALSE )
ngram( "A", n=1, include.all=TRUE )
ngram( c("A","B"), n=1, include.all=FALSE )
ngram( c("A","B"), n=1, include.all=TRUE )
ngram( c("A","B","C"), n=1, include.all=FALSE )
ngram( c("A","B","C"), n=1, include.all=TRUE )