2

I have vector of sentences, say:

x = c("I like donut", "I like pizza", "I like donut and pizza")

I want to count combination of two words. Ideal output is a data-frame with 3 columns (word1, word2 and frequency), and would be something like this :

 I      like    3
 I      donut   2
 I      pizza   2
 like   donut   2
 like   pizza   2
 donut  pizza   1
 donut  and     1
 pizza  and     1

In the first records of output, freq = 3 because "I" and "like" occurs together 3 times: x[1], x[2] and x[3].

Any advises are appreciated :)

JasonMArcher
  • 14,195
  • 22
  • 56
  • 52
nurandi
  • 1,588
  • 1
  • 11
  • 20
  • 2
    Did you use google or the search bar before posting this question? Try [this](http://stackoverflow.com/questions/11403196/r-count-times-word-appears-in-element-of-list) or [this](http://stackoverflow.com/questions/18864612/frequency-of-occurrence-of-two-pair-combinations-in-text-data-in-r) or [any of these](http://stackoverflow.com/search?q=R+word+combinations). – Oliver Keyes Dec 20 '14 at 00:58
  • What about `I I` and `like like`, etc? Presumably you want only those combinations of *different* words? `gtools::permutations` might be useful for you here – Rich Scriven Dec 20 '14 at 01:22
  • @OliverKeyes : yes, of course. – nurandi Dec 20 '14 at 01:35
  • @RichardScriven. Yes, I want only combination of different words. Thank you for your suggestion, I will try with `gtools` :) – nurandi Dec 20 '14 at 01:37

2 Answers2

6

split into words, sort to identify pairs properly, get all pairs with combn, paste pairs to get space-separated pairs of words, use table to get the frequencies, then put it all together.

Here's an example:

f <- function(x) {
  pr <- unlist(
    lapply(
      strsplit(x, ' '), 
      function(i) combn(sort(i), 2, paste, collapse=' ')
    )
  )

  tbl <- table(pr)

  d <- do.call(rbind.data.frame, strsplit(names(tbl), ' '))
  names(d) <- c('word1', 'word2')
  d$Freq <- tbl

  d
}

With your example data:

> f(x)
   word1 word2 Freq
1    and donut    1
2    and     I    1
3    and  like    1
4    and pizza    1
5  donut     I    2
6  donut  like    2
7  donut pizza    1
8      I  like    3
9      I pizza    2
10  like pizza    2
Matthew Lundberg
  • 42,009
  • 6
  • 90
  • 112
  • Great. Using `combn`, I can also count occurrence of combination of 3 or more words. Thank you :) – nurandi Dec 20 '14 at 02:28
0
library(tidyr)
Counts <- DF %>% 
  count(column1, column2, sort = TRUE)
Brad
  • 580
  • 4
  • 19