Group by grepping for terms in a column with dplyr

Question

I have a dataframe as follows:

Symptom                                          number        

Abdominal pain\n Swallowing probs\n Back issues\n       22
Abdominal pain\n                                        12
Back issues \n Vomiting \n                                 14
Back issues\n                                            5

There is always a \n at the end of each symptom phrase. The symptom phrase itself can literally be anything so I don't want to search for these terms specifically, but rather any term before (or between) \n

I would like to average the number for each symptom so that I end up with:

Symptom                       Avg
Abdominal pain                 17
Swallowing probs               22
Back issues                    20.5
Vomiting                       14

I don't know how to group by the individual terms with dplyr. I have tried

SypmAvg<- df %>% group_by(grepl("(?\\n.*\\n)|($.*?\\n)",df$Symptom)%>% summarise(mean=mean(number)

but it just crashes my computer so I don't even get to see the error. Can anyone help? Is it just a regex issue or is there a better way to do this?

akrun · Accepted Answer · 2017-01-06T10:38:17.603

2

We can use cSplit

library(splitstackshape)
cSplit(df, "Symptom", "\\n", "long")[, .(Avg = mean(number)), .(Symptom)]

edited Jan 06 '17 at 10:38

answered Jan 06 '17 at 10:29

akrun

874,273
37
540
662

Thanks. Is it possible that the Avg is just being done for the second half of the split. My numbers don't seem to add up – Sebastian Zeki Jan 06 '17 at 11:08
Aha. I think its because of the NA's. I guess I just have to put na.rm=T – Sebastian Zeki Jan 06 '17 at 11:32

score 1 · Answer 2 · answered Jan 06 '17 at 15:49

1

library(dplyr)
df1 = df %>% group_by(id) %>% mutate(new_col = strsplit(Symptom, "\n")) %>% unnest()

df1 %>% group_by(trimws(new_col)) %>% summarise( ans = mean(number))

# new_col   ans
# 1   Abdominal pain 17.00000
# 2      Back issues 13.66667
# 3 Swallowing probs 22.00000
# 4         Vomiting 14.00000

answered Jan 06 '17 at 15:49

joel.wilson

8,243
5
28
48

@SebastianZeki number didn't match for Back Issues... is yours correct? – joel.wilson Jan 07 '17 at 11:30

Group by grepping for terms in a column with dplyr

2 Answers2