Group data frame by pattern in R

Question

I have R data frame with hundreds of rows as

word        Freq
seed         4
seeds        3
contract     2
contracting  2
river        1

I would like to group the data by patterns, say seed + seeds ... that looks like

word     Freq
seed      7
contract  4
river     1

I'm not sure there is one post on SO that does everything you want, but there are several you could put together to solve your problem. For example, check out this post on [pattern-matching](http://stackoverflow.com/questions/20219311/pattern-matching-and-replacement-in-r) and this post on [the summary function](http://stackoverflow.com/questions/9847054/how-to-get-summary-statistics-by-group). Also, including [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) would improve your post. As written, you have a broad question. — Richard Erickson, Oct 26 '15 at 02:27

score 3 · Answer 1 · answered Oct 26 '15 at 02:46

Here is potentially another way to go. In the SnowballC package, there is a function which cleans up words and get word stems (i.e, wordStem()). Using that, you can skip string manipulation, I think. Once you get this process done, all you do is to get sum of word frequency.

library(SnowballC)
library(dplyr)

mydf <- read.table(text = "word        Freq
seed         4
seeds        3
contract     2
contracting  2
river        1", header = T)

mutate(mydf, word = wordStem(word)) %>%
group_by(word) %>%
summarise(total = sum(Freq))

#      word total
#     (chr) (int)
#1 contract     4
#2    river     1
#3     seed     7

I gotta say, `wordStem()` is pretty badass – Rich Scriven Oct 26 '15 at 02:50 — Rich Scriven, Oct 26 '15 at 02:50
@RichardScriven Hehe, yeah it is! – jazzurro Oct 26 '15 at 02:51 — jazzurro, Oct 26 '15 at 02:51

akrun · Answer 2 · 2015-10-26T02:17:09.237

One option would be to create a grouping variable 'gr' by extracting substring based on the minimum number of characters in 'word', do this one more with 'word' sp that we can get the substring for each group of words, and then get the sum of 'Freq' by 'word'.

library(dplyr)
 df1 %>% 
    group_by(gr= substr(word, 1, min(nchar(word)))) %>%
    group_by(word= substr(word, 1, min(nchar(word)))) %>%
    summarise(Freq= sum(Freq)) 
    word  Freq
#      (chr) (int)
#1 contract     4
#2    river     1
#3     seed     7

bramtayl · Answer 3 · 2015-10-26T02:32:09.290

1

Can also do with cross-join, which is a little bit safer than the above method.

library(dplyr)
library(stringi)

df %>%
  merge(df %>% select(short_word = word) ) %>%
  filter(short_word %>%
           stri_detect_regex(word, .) ) %>%
  group_by(word) %>%
  slice(short_word %>% stri_length %>% which.min) %>%
  group_by(short_word) %>%
  summarise(Freq= sum(Freq))

edited Oct 26 '15 at 02:32

answered Oct 26 '15 at 02:16

bramtayl

4,004
2
11
18

Yes, had seen, but a lot of functions to get this right. – akrun Oct 26 '15 at 02:34
I'm also not getting the expected output. – Samuel Shamiri Oct 26 '15 at 03:03

thelatemail · Answer 4 · 2015-10-26T04:11:11.247

An attempt using adist to match the terms up.

dat$grp <- seq(nrow(dat))

# generate a matrix comparing the vector of words to themselves
tmp <- adist(dat$word, dat$word, partial=TRUE)
diag(tmp) <- Inf
dat$grp[col(tmp)[tmp==0]] <- row(tmp)[tmp==0]

final <- aggregate(Freq ~ grp, data=dat, sum)
final$word <- dat$word[match(final$grp, dat$grp)]

#  grp Freq     word
#1   1    7     seed
#2   3    4 contract
#3   5    1    river

Data used:

dat <- data.frame(word=c("seed","seeds","contract","contracting","river"),Freq=c(4,3,2,2,1))

Group data frame by pattern in R

4 Answers4

Linked

Related