Extract and count common word-pairs from character vector

Question

How can someone find frequent pairs of adjacent words in a character vector? Using the crude data set, for example, some common pairs are "crude oil", "oil market", and "million barrels".

The code for the small example below tries to identify frequent terms and then, using a positive lookahead assertion, count how many times those frequent terms are followed immediately by a frequent term. But the attempt crashed and burned.

Any guidance would be appreciated as to how to create a data frame that shows in the first column ("Pairs") the common pairs and in the second column ("Count") the number of times they appeared in the text.

   library(qdap)
   library(tm)

# from the crude data set, create a text file from the first three documents, then clean it

text <- c(crude[[1]][1], crude[[2]][1], crude[[3]][1])
text <- tolower(text)
text <- tm::removeNumbers(text)
text <- str_replace_all(text, "  ", "") # replace double spaces with single space
text <- str_replace_all(text, pattern = "[[:punct:]]", " ")
text <- removeWords(text, stopwords(kind = "SMART"))

# pick the top 10 individual words by frequency, since they will likely form the most common pairs
freq.terms <- head(freq_terms(text.var = text), 10) 

# create a pattern from the top words for the regex expression below
freq.terms.pat <- str_c(freq.terms$WORD, collapse = "|")

# match frequent terms that are followed by a frequent term
library(stringr)
pairs <- str_extract_all(string = text, pattern = "freq.terms.pat(?= freq.terms.pat)")

Here is where the effort falters.

Not knowing Java or Python, these did not help Java count word pairs Python count word pairs but they may be useful references for others.

Thank you.

This will give you how to make a term doc matrix with ngrams, then you can use `rowSum` to get occurrences and choose frequent ones: http://stackoverflow.com/questions/28033034/r-and-tm-package-create-a-term-document-matrix-with-a-dictionary-of-one-or-two — Tyler Rinker, Jun 14 '15 at 15:28

Steven Beaupré · Answer 1 · 2015-06-14T19:35:58.987

First, modify your initial text list from:

text <- c(crude[[1]][1], crude[[2]][2], crude[[3]][3])

to:

text <- c(crude[[1]][1], crude[[2]][1], crude[[3]][1])

Then, you can go on with your text cleaning (note that your method will create ill-formed words like "oilcanadian", but it will suffice for the example at hand):

text <- tolower(text)
text <- tm::removeNumbers(text)
text <- str_replace_all(text, "  ", "") 
text <- str_replace_all(text, pattern = "[[:punct:]]", " ")
text <- removeWords(text, stopwords(kind = "SMART"))

Build a new Corpus:

v <- Corpus(VectorSource(text))

Create a bigram tokenizer function:

BigramTokenizer <- function(x) { 
  unlist(
    lapply(ngrams(words(x), 2), paste, collapse = " "), 
    use.names = FALSE
  ) 
}

Create your TermDocumentMatrix using the control parameter tokenize:

tdm <- TermDocumentMatrix(v, control = list(tokenize = BigramTokenizer))

Now that you have your new tdm, to get your desired output, you could do:

library(dplyr)
data.frame(inspect(tdm)) %>% 
  add_rownames() %>% 
  mutate(total = rowSums(.[,-1])) %>% 
  arrange(desc(total))

Which gives:

#Source: local data frame [272 x 5]
#
#             rowname X1 X2 X3 total
#1          crude oil  2  0  1     3
#2            mln bpd  0  3  0     3
#3         oil prices  0  3  0     3
#4       cut contract  2  0  0     2
#5        demand opec  0  2  0     2
#6        dlrs barrel  2  0  0     2
#7    effective today  1  0  1     2
#8  emergency meeting  0  2  0     2
#9      oil companies  1  1  0     2
#10      oil industry  0  2  0     2
#..               ... .. .. ..   ...

A good answer, but @agstudy's result in a data frame is easier to deal with. More than that, I am hoping for a count of bigrams across the entire corpus, not just per document. Please see my comment to the other answer. — lawyeR, Jun 15 '15 at 11:05
@lawyeR my result is in a data frame as well. You also have the count of bigrams across the entire corpus in the `total` column. — Steven Beaupré, Jun 15 '15 at 12:11
@lawyeR I would appreciate your feedback as I still do not understand your comment and what's wrong with the solution I provided. — Steven Beaupré, Jun 17 '15 at 20:56
I didn't imply your answer was wrong; I said it was good. I ran both and preferred how the other answer came out. Nothing deep; nothing malicious; just can't share checkmarks on two answers and have to pick. I did upvote yours, for what it's worth. — lawyeR, Jun 17 '15 at 21:39

agstudy · Accepted Answer · 2015-06-15T11:46:08.733

One idea here , is to create a new corpus with bigrams.:

A bigram or digram is every sequence of two adjacent elements in a string of tokens

A recursive function to extract bigram :

bigram <- 
  function(xs){
    if (length(xs) >= 2) 
       c(paste(xs[seq(2)],collapse='_'),bigram(tail(xs,-1)))

  }

Then applying this to crude data from tm package. ( I did some text cleaning here, but this steps depends in the text).

res <- unlist(lapply(crude,function(x){

  x <- tm::removeNumbers(tolower(x))
  x <- gsub('\n|[[:punct:]]',' ',x)
  x <- gsub('  +','',x)
  ## after cleaning a compute frequency using table 
  freqs <- table(bigram(strsplit(x," ")[[1]]))
  freqs[freqs>1]
}))


 as.data.frame(tail(sort(res),5))
                          tail(sort(res), 5)
reut-00022.xml.hold_a                      3
reut-00022.xml.in_the                      3
reut-00011.xml.of_the                      4
reut-00022.xml.a_futures                   4
reut-00010.xml.abdul_aziz                  5

The bigrams "abdul aziz" and "a futures" are the most common. You should reclean the data to remove (of, the,..). But this should be a good start.

edit after OP comments :

In case you want to get bigrams-frequency over all the corpus , on idea is to compute the bigrams in the loop and then compute the frequency for the loop result. I profit to add better text processing-cleanings.

res <- unlist(lapply(crude,function(x){
  x <- removeNumbers(tolower(x))
  x <- removeWords(x, words=c("the","of"))
  x <- removePunctuation(x)
  x <- gsub('\n|[[:punct:]]',' ',x)
  x <- gsub('  +','',x)
  ## after cleaning a compute frequency using table 
  words <- strsplit(x," ")[[1]]
  bigrams <- bigram(words[nchar(words)>2])
}))

xx <- as.data.frame(table(res))
setDT(xx)[order(Freq)]


#                 res Freq
#    1: abdulaziz_bin    1
#    2:  ability_hold    1
#    3:  ability_keep    1
#    4:  ability_sell    1
#    5:    able_hedge    1
# ---                   
# 2177:    last_month    6
# 2178:     crude_oil    7
# 2179:  oil_minister    7
# 2180:     world_oil    7
# 2181:    oil_prices   14

A good answer, @agstudy. My results show some bigrams multiple times (e.g., "case_team", perhaps because your solution looks within each document and counts its bigrams? Is there a tweak to show the frequency of bigrams across the entire corpus, which is really what I want? Or should I post a follow-up question? > tail(sort(res), 5) case_team privilege_review additional_search search_terms case_team 3 3 4 4 4 — lawyeR, Jun 15 '15 at 11:03

Extract and count common word-pairs from character vector

2 Answers2

edit after OP comments :