0

I am familiar with using the tm library to create a tdm and count frequencies of terms.

But these terms are all single-word.

How can do count the # of times a multi-word phrase occurs in a document and/or corpus?

EDIT:

I am adding the code I have now to improve/clarify my post.

This is pretty standard code to build a term-document matrix:

library(tm)


cname <- ("C:/Users/George/Google Drive/R Templates/Gospels corpus")   

corpus <- Corpus(DirSource(cname))

#Cleaning
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, removeWords, c("a","the","an","that","and"))

#convert to a plain text file
corpus <- tm_map(corpus, PlainTextDocument)

#Create a term document matrix
tdm1 <- TermDocumentMatrix(corpus)

m1 <- as.matrix(tdm1)
word.freq <- sort(rowSums(m1), decreasing=T)
word.freq<-word.freq[1:100]

The problem is that this returns a matrix of single word terms, example:

  all      into      have      from      were       one      came       say       out 
  397       390       385       383       350       348       345       332       321

I want to be able to search for multi-word terms in the corpus instead. So for example "came from" instead of just "came" and "from" separately.

Thank you.

  • Please read the info about [how to ask a good question](http://stackoverflow.com/help/how-to-ask) and how to give a [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example/5963610). This will make it much easier for others to help you. – Jaap Apr 19 '17 at 13:04

3 Answers3

0

Given the text:

text <- "This is my little R text example and I want to count the frequency of some pattern (and - is - my - of). This is my little R text example and I want to count the frequency of some patter."

For find frequency of words:

table(strsplit(text, ' '))


   -      (and       and     count   example frequency         I        is    little        my 
    3         1         2         2         2         2         2         3         2         3 
   of      of).   patter.   pattern         R      some      text       the      This        to 
    2         1         1         1         2         2         2         2         2         2 
 want 
    2 

For frequency of a pattern:

attr(regexpr('is', text), "match.length")

[1] 3
gonzalez.ivan90
  • 1,322
  • 1
  • 12
  • 22
0

I created following function for obtaining word n-grams and their corresponding frequencies

library(tau) 
library(data.table)
# given a string vector and size of ngrams this function returns     word ngrams with corresponding frequencies
createNgram <-function(stringVector, ngramSize){

  ngram <- data.table()

  ng <- textcnt(stringVector, method = "string", n=ngramSize, tolower = FALSE)

  if(ngramSize==1){
    ngram <- data.table(w1 = names(ng), freq = unclass(ng), length=nchar(names(ng)))  
  }
  else {
    ngram <- data.table(w1w2 = names(ng), freq = unclass(ng), length=nchar(names(ng)))
  }
  return(ngram)
}

Given a string like

text <- "This is my little R text example and I want to count the frequency of some pattern (and - is - my - of). This is my little R text example and I want to count the frequency of some patter."

Here is how to call the function for a pair of words, for phrases of length 3 pass 3 as argument

res <- createNgram(text, 2)

printing res outputs

           w1w2      freq   length
 1:        I want    2      6
 2:        R text    2      6
 3:       This is    2      7
 4:         and I    2      5
 5:        and is    1      6
 6:     count the    2      9
 7:   example and    2     11
 8:  frequency of    2     12
 9:         is my    3      5
10:      little R    2      8
11:     my little    2      9
12:         my of    1      5
13:       of This    1      7
14:       of some    2      7
15:   pattern and    1     11
16:   some patter    1     11
17:  some pattern    1     12
18:  text example    2     12
19: the frequency    2     13
20:      to count    2      8
21:       want to    2      7
Imran Ali
  • 2,223
  • 2
  • 28
  • 41
  • you might also find the [tokenizers package](https://cran.r-project.org/web/packages/tokenizers/index.html) useful. See `tokenize_ngram()` in the package documentation. – Imran Ali Apr 19 '17 at 13:33
0

Here is a nice example with code using Tidytext: https://www.kaggle.com/therohk/news-headline-bigrams-frequency-vs-tf-idf

The same technique can be extended to larger n values.

bigram_tf_idf <- bigrams %>%
  count(year, bigram) %>%
  filter(n > 2) %>%
  bind_tf_idf(bigram, year, n) %>%
  arrange(desc(tf_idf))

bigram_tf_idf.plot <- bigram_tf_idf %>%
  arrange(desc(tf_idf)) %>%
  filter(tf_idf > 0) %>%
  mutate(bigram = factor(bigram, levels = rev(unique(bigram))))

bigram_tf_idf.plot %>% 
  group_by(year) %>% 
  top_n(10) %>% 
  ungroup %>%
  ggplot(aes(bigram, tf_idf, fill = year)) +
  geom_col(show.legend = FALSE) +
  labs(x = NULL, y = "tf-idf") +
  facet_wrap(~year, ncol = 3, scales = "free") +
  theme(text = element_text(size = 10)) +
  coord_flip()
Rohit
  • 392
  • 3
  • 14