1

I was trying to find the frequency of character patterns (word parts) in a large data set.

For example, I have a list of the following in a csv file:

  • applestrawberrylime
  • applegrapelime
  • pineapplemangoguava
  • kiwiguava
  • grapeapple
  • mixedberry
  • kiwiguavapineapple
  • limemixedberry

Is there a way to find the frequency of all character combinations? Like:

  • appleberry
  • guava
  • applestrawberry
  • kiwiguava
  • grapeapple
  • straw
  • app
  • ap
  • wig
  • mem
  • go

Update: This is what I have for finding the frequency of all character patterns of length three in my data:

threecombo  <- do.call(paste0,expand.grid(rep(list(c('a', 'b', 'c', 'd','e','f','g','h','i','j','k','l','m','n','o','p','q','r','s','t','u','v','w','x','y','z')), 3)))

threecompare<-sapply(threecombo, function(x) length(grep(x, myData)))

The code works the way I want it to, and I would like to repeat the above steps for longer character lengths (4, 5, 6, etc) but it takes a while to run. Is there a better way of doing this?

  • 1
    Welcome to Stackoverflow! Your question is interesting but very hard to answer. Really this site works better when there is a clear question. In your case you might want to provide a link to a corpus of words, and then show some code that you have tried to use, and then the problems that you have with that code. See http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example for some tips! – Andy Clifton Nov 20 '15 at 00:38
  • Thanks I updated my question with the progress I've made with my code so far –  Nov 23 '15 at 22:59

2 Answers2

2

Since you are probably looking for combinations of fruit flavors from a set of text that includes non-fruit words, I've made up some documents similar to those in your example. I've used the quanteda package to construct a document-term matrix and then filter based on ngrams containing the fruit words.

docs <- c("One flavor is apple strawberry lime.", 
          "Another flavor is apple grape lime.", 
          "Pineapple mango guava is our newest flavor.",
          "There is also kiwi guava and grape apple.", 
          "Mixed berry was introduced last year.", 
          "Did you like kiwi guava pineapple?",
          "Try the lime mixed berry.")
flavorwords <- c("apple", "guava", "berry", "kiwi", "guava", "grape")

require(quanteda)
# form a document-feature matrix ignoring common stopwords + "like"
# for ngrams, bigrams, trigrams
fruitDfm <- dfm(docs, ngrams = 1:3, ignoredFeatures = c("like", "also", stopwords("english")))
## Creating a dfm from a character vector ...
##    ... lowercasing
##    ... tokenizing
##    ... indexing documents: 7 documents
##    ... indexing features: 90 feature types
##    ... removed 47 features, from 176 supplied (glob) feature types
##    ... complete. 
##    ... created a 7 x 40 sparse dfm
## Elapsed time: 0.01 seconds.
# select only those features containing flavorwords as regular expression
fruitDfm <- selectFeatures(fruitDfm, flavorwords, valuetype = "regex")
## kept 22 features, from 5 supplied (regex) feature types
# show the features
topfeatures(fruitDfm, nfeature(fruitDfm))
##                apple                 guava                 grape             pineapple                  kiwi 
##                    3                     3                     2                     2                     2 
##           kiwi_guava                 berry           mixed_berry            strawberry      apple_strawberry 
##                    2                     2                     2                     1                     1 
##      strawberry_lime apple_strawberry_lime           apple_grape            grape_lime      apple_grape_lime 
##                    1                     1                     1                     1                     1 
##      pineapple_mango           mango_guava pineapple_mango_guava           grape_apple       guava_pineapple 
##                    1                     1                     1                     1                     1 
## kiwi_guava_pineapple      lime_mixed_berry 
##                    1                     1 

Added:

If you are looking to match the terms not separated by spaces to the document, you can form ngrams with a null string concatenator, and match as below.

flavorwordsConcat <- c("applestrawberrylime", "applegrapelime", "pineapplemangoguava",
                       "kiwiguava", "grapeapple", "mixedberry", "kiwiguavapineapple",
                       "limemixedberry")

fruitDfm <- dfm(docs, ngrams = 1:3, concatenator = "")
fruitDfm <- fruitDfm[, features(fruitDfm) %in% flavorwordsConcat]
fruitDfm
# Document-feature matrix of: 7 documents, 8 features.
# 7 x 8 sparse Matrix of class "dfmSparse"
#        features
# docs  applestrawberrylime applegrapelime pineapplemangoguava kiwiguava grapeapple mixedberry kiwiguavapineapple limemixedberry
# text1                   1              0                   0         0          0          0                  0              0
# text2                   0              1                   0         0          0          0                  0              0
# text3                   0              0                   1         0          0          0                  0              0
# text4                   0              0                   0         1          1          0                  0              0
# text5                   0              0                   0         0          0          1                  0              0
# text6                   0              0                   0         1          0          0                  1              0
# text7                   0              0                   0         0          0          1                  0              1

If your text contains the concatenated flavour words, then you can match the unigram dfm to all trigram permutations of individual fruit words using

unigramFlavorWords <- c("apple", "guava", "grape", "pineapple", "kiwi")
head(unlist(combinat::permn(unigramFlavorWords, paste, collapse = "")))
[1] "appleguavagrapepineapplekiwi" "appleguavagrapekiwipineapple" "appleguavakiwigrapepineapple" 
[4] "applekiwiguavagrapepineapple" "kiwiappleguavagrapepineapple" "kiwiappleguavapineapplegrape"
Ken Benoit
  • 14,454
  • 27
  • 50
  • I like your answer, but how would it work if the items stored in docs wasn't separated by spaces? docs<-c("applestraberrylime", "kiwipineapple","grapelemon") Also what if you didn't know all the possible flavorwords? –  Nov 23 '15 at 16:30
  • Not sure exactly what you are asking but tried to cover both possibilities in an updated answer. – Ken Benoit Nov 23 '15 at 20:18
1

Your initial question was a simple task for grep / grepl, and I see you've incorporated this part of my answer into your revised question.

docs <- c('applestrawberrylime', 'applegrapelime', 'pineapplemangoguava',
          'kiwiguava', 'grapeapple', 'mixedberry', 'kiwiguavapineapple',
          'limemixedberry')

patterns <-  c('appleberry', 'guava', 'applestrawberry', 'kiwiguava', 
               'grapeapple', 'grape', 'app', 'ap', 'wig', 'mem', 'go')

# how often does each pattern occur in the set of docs?
sapply(patterns, function(x) sum(grepl(x, docs)))

If you want to check for every possible pattern, you can search for every combination of letters (as you begin doing above), but that's obviously the long way around.

One strategy is to count the frequency only of each pattern that actually occurs. Each document of character length n has 1 possible pattern of length n, 2 patterns of length n - 1 and so on. You can extract each of these, then count em up.

all_patterns <- lapply(docs, function(x) {

    # individual chars in this doc
    chars <- unlist(strsplit(x, ''))

    # unique possible sequence lengths
    seqs <- sapply(1:nchar(x), seq)

    # each sequence in each position
    sapply(seqs, function(y) {
      start_pos <- 0:(nchar(x) - max(y))
      sapply(start_pos, function(z) paste(chars[z + y], collapse=''))
    })
})

unq_patterns <- unique(unlist(all_patterns))

# how often does each unique pattern occur in the set of docs?
occur <- sapply(unq_patterns, function(x) sum(grepl(x, docs)))

# top 25 most frequent patterns
sort(occur, decreasing = T)[1:25]     

# e     i     a     l     p     r     m    ap    pp    pl    le   app   ppl   
# 7     7     6     6     5     5     5     5     5     5     5     5     5
# ple  appl  pple apple     g     w     b     y  ra    be    er    rr 
#   5     5     5     5     5     3     3     3   3     3     3     3 

This works and runs quickly, but as the corpus of docs grows longer, you may bog down (even on this simple example, there are 625 unique patterns). One could use parallel processing for all the s/lapply calls, but still...

arvi1000
  • 9,393
  • 2
  • 42
  • 52
  • This is close, but I'm looking for something that will help me when I don't know all possible patterns/combinations. –  Nov 23 '15 at 17:07
  • What have you tried? Think what contents of 'pattern' could you supply to get the desired result. Also, consider that for a large corpus of text, "all possible char combinations" is going to be pretty huge set – arvi1000 Nov 23 '15 at 19:37
  • I've just updated my question with what I have so far. I realize the data set is huge, but I don't know what all the possible patterns are. I'm trying to discover what the most frequent patterns are. –  Nov 23 '15 at 23:04