5

I have R data frame with hundreds of rows as

word        Freq
seed         4
seeds        3
contract     2
contracting  2
river        1

I would like to group the data by patterns, say seed + seeds ... that looks like

word     Freq
seed      7
contract  4
river     1
Samuel Shamiri
  • 121
  • 1
  • 4
  • I'm not sure there is one post on SO that does everything you want, but there are several you could put together to solve your problem. For example, check out this post on [pattern-matching](http://stackoverflow.com/questions/20219311/pattern-matching-and-replacement-in-r) and this post on [the summary function](http://stackoverflow.com/questions/9847054/how-to-get-summary-statistics-by-group). Also, including [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) would improve your post. As written, you have a broad question. – Richard Erickson Oct 26 '15 at 02:27

4 Answers4

3

Here is potentially another way to go. In the SnowballC package, there is a function which cleans up words and get word stems (i.e, wordStem()). Using that, you can skip string manipulation, I think. Once you get this process done, all you do is to get sum of word frequency.

library(SnowballC)
library(dplyr)

mydf <- read.table(text = "word        Freq
seed         4
seeds        3
contract     2
contracting  2
river        1", header = T)

mutate(mydf, word = wordStem(word)) %>%
group_by(word) %>%
summarise(total = sum(Freq))

#      word total
#     (chr) (int)
#1 contract     4
#2    river     1
#3     seed     7
jazzurro
  • 23,179
  • 35
  • 66
  • 76
2

One option would be to create a grouping variable 'gr' by extracting substring based on the minimum number of characters in 'word', do this one more with 'word' sp that we can get the substring for each group of words, and then get the sum of 'Freq' by 'word'.

library(dplyr)
 df1 %>% 
    group_by(gr= substr(word, 1, min(nchar(word)))) %>%
    group_by(word= substr(word, 1, min(nchar(word)))) %>%
    summarise(Freq= sum(Freq)) 
    word  Freq
#      (chr) (int)
#1 contract     4
#2    river     1
#3     seed     7
akrun
  • 874,273
  • 37
  • 540
  • 662
1

Can also do with cross-join, which is a little bit safer than the above method.

library(dplyr)
library(stringi)

df %>%
  merge(df %>% select(short_word = word) ) %>%
  filter(short_word %>%
           stri_detect_regex(word, .) ) %>%
  group_by(word) %>%
  slice(short_word %>% stri_length %>% which.min) %>%
  group_by(short_word) %>%
  summarise(Freq= sum(Freq)) 
bramtayl
  • 4,004
  • 2
  • 11
  • 18
1

An attempt using adist to match the terms up.

dat$grp <- seq(nrow(dat))

# generate a matrix comparing the vector of words to themselves
tmp <- adist(dat$word, dat$word, partial=TRUE)
diag(tmp) <- Inf
dat$grp[col(tmp)[tmp==0]] <- row(tmp)[tmp==0]

final <- aggregate(Freq ~ grp, data=dat, sum)
final$word <- dat$word[match(final$grp, dat$grp)]

#  grp Freq     word
#1   1    7     seed
#2   3    4 contract
#3   5    1    river

Data used:

dat <- data.frame(word=c("seed","seeds","contract","contracting","river"),Freq=c(4,3,2,2,1))
thelatemail
  • 91,185
  • 12
  • 128
  • 188