4

I am trying to generate a list of all unigrams through trigrams in R to, eventually, make a document-phrase matrix with columns including all single words, bigrams, and trigrams.

I expected to find an easy package for this, and have not succeeded. I did end up getting pointed to RWeka, code and output below, but unfortunately this approach drops all unigrams of 2 or 1 character.

Can this be repaired, or do folks know of another road? Thanks!

TrigramTokenizer <- function(x) NGramTokenizer(x, 
                                               Weka_control(min = 1, max = 3))
Text = c( "Ab Hello world","Hello ab",  "ab" )
tt = Corpus(VectorSource(Text))
tdm <- TermDocumentMatrix( tt, 
                           control = list(tokenize = TrigramTokenizer))
inspect(tdm)
# <<TermDocumentMatrix (terms: 6, documents: 3)>>
# Non-/sparse entries: 7/11
# Sparsity           : 61%
# Maximal term length: 14
# Weighting          : term frequency (tf)

#                 Docs
# Terms            1 2 3
#   ab hello       1 0 0
#   ab hello world 1 0 0
#   hello          1 1 0
#   hello ab       0 1 0
#   hello world    1 0 0
#   world          1 0 0

Here is a version of ngram() from below, edited for optimality (I think). Basically I tried to reuse the strings of tokens to get out of the double-loop when include.all=TRUE.

ngram <- function(tokens, n = 2, concatenator = "_", include.all = FALSE) {
    M = length(tokens)

    stopifnot( n > 0 )

    # if include.all=FALSE return null if nothing to report due to short doc
    if ( ( M == 0 ) || ( !include.all && M < n ) ) {
        return( c() )
    }

    # bail if just want original tokens or if we only have one token
    if ( (n == 1) || (M == 1) ) {
        return( tokens )
    }

    # set max size of ngram at max length of tokens
    end <- min( M-1, n-1 )

    all_ngrams <- c()
    toks = tokens
    for (width in 1:end) {
        if ( include.all ) {
            all_ngrams <- c( all_ngrams, toks )
        }
        toks = paste( toks[1:(M-width)], tokens[(1+width):M], sep=concatenator )
    }
    all_ngrams <- c( all_ngrams, toks )

    all_ngrams
}

ngram( c("A","B","C","D"), n=3, include.all=TRUE ) 
ngram( c("A","B","C","D"), n=3, include.all=FALSE ) 

ngram( c("A","B","C","D"), n=10, include.all=FALSE ) 
ngram( c("A","B","C","D"), n=10, include.all=TRUE ) 


# edge cases
ngram( c(), n=3, include.all=TRUE ) 
ngram( "A", n=0, include.all=TRUE ) 
ngram( "A", n=3, include.all=TRUE ) 
ngram( "A", n=3, include.all=FALSE ) 
ngram( "A", n=1, include.all=FALSE ) 
ngram( "A", n=1, include.all=TRUE ) 
ngram( c("A","B"), n=1, include.all=FALSE ) 
ngram( c("A","B"), n=1, include.all=TRUE ) 
ngram( c("A","B","C"), n=1, include.all=FALSE ) 
ngram( c("A","B","C"), n=1, include.all=TRUE ) 
miratrix
  • 191
  • 2
  • 12

2 Answers2

5

You're in luck, there is a package for this: quanteda.

# or: devtools::install_github("kbenoit/quanteda")
require(quanteda)

Text <- c("Ab Hello world", "Hello ab", "ab")

dfm(Text, ngrams = 1:3, verbose = FALSE)
## Document-feature matrix of: 3 documents, 7 features.
## 3 x 7 sparse Matrix of class "dfmSparse"
## features
## docs    ab ab_hello ab_hello_world hello hello_ab hello_world world
## text1  1        1              1     1        0           1     1
## text2  1        0              0     1        1           0     0
## text3  1        0              0     0        0           0     0

This creates a document-feature matrix, where the "features" are lower-cased unigrams, bigrams, and trigrams. If you prefer spaces between the words, just add the argument concatenator = " " to the dfm() call.

Problem solved, no need for Weka.

For the curious, here is the workhorse function that creates the n-grams, where tokens is a character vector (from a separate tokenizer):

ngram <- function(tokens, n = 2, concatenator = "_", include.all = FALSE) {

    # start with lower ngrams, or just the specified size if include.all = FALSE
    start <- ifelse(include.all, 
                    1, 
                    ifelse(length(tokens) < n, 1, n))

    # set max size of ngram at max length of tokens
    end <- ifelse(length(tokens) < n, length(tokens), n)

    all_ngrams <- c()
    # outer loop for all ngrams down to 1
    for (width in start:end) {
        new_ngrams <- tokens[1:(length(tokens) - width + 1)]
        # inner loop for ngrams of width > 1
        if (width > 1) {
            for (i in 1:(width - 1)) 
                new_ngrams <- paste(new_ngrams, 
                                    tokens[(i + 1):(length(tokens) - width + 1 + i)], 
                                    sep = concatenator)
        }
        # paste onto previous results and continue
        all_ngrams <- c(all_ngrams, new_ngrams)
    }

    all_ngrams
}
Ken Benoit
  • 14,454
  • 27
  • 50
  • Just poked at your ngram method. I think it repeats some work. I am posting an updated method as an edit to my original question---take a look if you like. (It is too long to fit as a comment) – miratrix Jul 09 '15 at 17:48
2

Oopse. It turns out there are options you can pass to control to do this. The termFreq method is called and you can pass options to it, such as what tokenizer to use (as above) and also what clean-up to do.

So the tweak of this works:

TrigramTokenizer <- function(x) NGramTokenizer(x, 
                                                Weka_control(min = 1, max = 3))
Text = c( "Ab Hello world","Hello ab",  "ab" )
tt = Corpus(VectorSource(Text))
tdm <- TermDocumentMatrix( tt, 
                            control = list(wordLengths=c(1,Inf), tokenize = TrigramTokenizer))
inspect(tdm)

giving

<<TermDocumentMatrix (terms: 7, documents: 3)>>
Non-/sparse entries: 10/11
Sparsity           : 52%
Maximal term length: 14
Weighting          : term frequency (tf)

                Docs
Terms            1 2 3
  ab             1 1 1
  ab hello       1 0 0
  ab hello world 1 0 0
  hello          1 1 0
  hello ab       0 1 0
  hello world    1 0 0
  world          1 0 0
miratrix
  • 191
  • 2
  • 12