Grouping by Multiple variables and summarizing character frequencies

Question

I am trying to group my dataset by multiple variables and build a frequency table of the number of times a character variable appears. Here is an example data set:

Location    State   County  Job         Pet
            Ohio    Miami   Data        Dog
Urban       Ohio    Miami   Business    Dog, Cat
Urban       Ohio    Miami   Data        Cat
Rural      Kentucky Clark   Data        Cat, Fish
City       Indiana  Shelby  Business    Dog

Rural Kentucky Clark Data Dog, Fish Ohio Miami Data Dog, Cat Urban Ohio Miami Business Dog, Cat Rural Kentucky Clark Data Fish City Indiana Shelby Business Cat

I want my output to look like this:

Location    State   County  Job      Frequency  Pet:Cat Pet:Dog Pet:Fish
            Ohio    Miami   Data        2         1        2       0
 Urban      Ohio    Miami   Business    2         2        2       0
 Urban      Ohio    Miami   Data        1         1        0       0
 Rural    Kentucky  Clark   Data        3         1        1       3
 City     Indiana   Shelby  Business    2         1        1       0

I have tried different iterations of the following code, and I get close, but not quite right:

Output<-df%>%group_by(Location, State, County, Job)%>%
  dplyr::summarise(
    Frequency= dplyr::n(),
    Pet:Cat = count(str_match(Pet, "Cat")),
    Pet:Dog = count(str_match(Pet, "Dog")),
    Pet:Fish = count(str_match(Pet, "Fish")),
    )

Any help would be appreciated! Thank you in advance

score 1 · Accepted Answer · answered Dec 18 '20 at 15:21

1

Try this:

library(dplyr)
library(tidyr)
#Code
new <- df %>% 
  separate_rows(Pet,sep=',') %>%
  mutate(Pet=trimws(Pet)) %>%
  group_by(Location,State,County,Job,Pet) %>%
  summarise(N=n()) %>%
  mutate(Pet=paste0('Pet:',Pet)) %>%
  group_by(Location,State,County,Job,.drop = F) %>%
  mutate(Freq=n()) %>%
  pivot_wider(names_from = Pet,values_from=N,values_fill=0)

Output:

# A tibble: 5 x 8
# Groups:   Location, State, County, Job [5]
  Location State    County Job       Freq `Pet:Cat` `Pet:Dog` `Pet:Fish`
  <chr>    <chr>    <chr>  <chr>    <int>     <int>     <int>      <int>
1 ""       Ohio     Miami  Data         2         1         2          0
2 "City"   Indiana  Shelby Business     2         1         1          0
3 "Rural"  Kentucky Clark  Data         3         1         1          3
4 "Urban"  Ohio     Miami  Business     2         2         2          0
5 "Urban"  Ohio     Miami  Data         1         1         0          0

Some data used:

#Data
df <- structure(list(Location = c("", "Urban", "Urban", "Rural", "City", 
"Rural", "", "Urban", "Rural", "City"), State = c("Ohio", "Ohio", 
"Ohio", "Kentucky", "Indiana", "Kentucky", "Ohio", "Ohio", "Kentucky", 
"Indiana"), County = c("Miami", "Miami", "Miami", "Clark", "Shelby", 
"Clark", "Miami", "Miami", "Clark", "Shelby"), Job = c("Data", 
"Business", "Data", "Data", "Business", "Data", "Data", "Business", 
"Data", "Business"), Pet = c("Dog", "Dog, Cat", "Cat", "Cat, Fish", 
"Dog", "Dog, Fish", "Dog, Cat", "Dog, Cat", "Fish", "Cat")), row.names = c(NA, 
-10L), class = "data.frame")

answered Dec 18 '20 at 15:21

Duck

39,058
13
42
84

I am getting this error: Error: `n()` must only be used inside dplyr verbs. – JeffB Dec 18 '20 at 15:27
I am assuming the n() function is being covered by a different package? – JeffB Dec 18 '20 at 15:28
@JeffB Some conflict with other package. Suggestion, re start `R` and load only the mentioned packages and run the code using the sample data `df`. – Duck Dec 18 '20 at 15:29
This worked for my sample data, but I am getting this message for my real data set: Error: Problem with `mutate()` input `Source`. x object 'Source' not found i Input `Source` is `paste0("Source:", Source)`. i The error occurred in group 1: Position = "", Race = "", Gender = "". Source would be the equivalent to Pet in my mock data set. Thank you for your help! – JeffB Dec 18 '20 at 15:43
@JeffB Looks like source has missing values, could you check `unique(yourdata$Source)`? – Duck Dec 18 '20 at 15:48
This is the case. Is there away to adjust this code to ignore these cells? – JeffB Dec 18 '20 at 15:50
@JeffB Yes, you can add a filter(...) to exclude the values you want, let me know if that is clear, or invite me to chat. Glad to help. – Duck Dec 18 '20 at 15:52
Not sure where I would add this filter. I tried: new <- df %>% filter(!Pets=="")%>% separate_rows(Pet,sep=',') %>% mutate(Pet=trimws(Pet)) %>% group_by(Location,State,County,Job,Pet) %>% summarise(N=n()) %>% mutate(Pet=paste0('Pet:',Pet)) %>% group_by(Location,State,County,Job,.drop = F) %>% mutate(Freq=n()) %>% pivot_wider(names_from = Pet,values_from=N,values_fill=0) But that returned the same results. – JeffB Dec 18 '20 at 16:30
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/226141/discussion-between-jeffb-and-duck). – JeffB Dec 18 '20 at 16:33

Grouping by Multiple variables and summarizing character frequencies

1 Answers1

Linked