0

I have a multi-response question from a survey.

The data look like this:

|respondent| friend          |
|----------|-----------------|
| 001      | John, Mary      |
|002       | Sue, John, Peter|

Then, I want to count, for each respondent, how many male and female friends they have. I imagine I need to create separate vectors of male and female names, then check each cell in the friend column against these vectors and count.

Any help is appreciated.

  • 1
    The biggest challenge is going to be to get an accurate list of names with their assumed male/female annotation. Do you have such a list? Not sure what you want to do for names that are used by both groups. You can split data into rows with solutions from: https://stackoverflow.com/questions/15347282/split-delimited-strings-in-a-column-and-insert-as-new-rows. But you'll need the appropriate lookup data to proceed after that. – MrFlick Feb 16 '22 at 19:37
  • Yes, I have a finite list for names for this purpose. There is also no overlap. I had a look at the link you provided. That makes sense. How do I look it up against my list. Just using the example here, assume the lookup data are: male: John, Peter female: Sue, Mary – Bronson Hui Feb 16 '22 at 19:43
  • 1
    If you have the list, just merge/join the two tables. See https://stackoverflow.com/questions/1299871/how-to-join-merge-data-frames-inner-outer-left-right – MrFlick Feb 16 '22 at 19:43
  • Further, if you have the list, I suggest you add it (or at a relevant subset of it) to your question, please don't make us guess at the presence or structure of it. – r2evans Feb 16 '22 at 19:46
  • thanks, @MrFlick! – Bronson Hui Feb 16 '22 at 19:54

2 Answers2

1

This should be heavily caveated, because many common names are frequently used by different genders. Here I use the genders applied in american social security data in the babynames package as a proxy. Then I merge that with my data and come up with a weighted count based on likelihood. In the dataset, fairly common names including Casey, Riley, Jessie, Jackie, Peyton, Jaime, Kerry, and Quinn are almost evenly split between genders, so in my approach those add about half a female friend and half a male friend, which seems to me the most sensible approach when the name alone doesn't add much information about gender.

library(tidyverse) # using dplyr, tidyr
gender_freq <- babynames::babynames %>%
  filter(year >= 1930) %>% # limiting to people <= 92 y.o.
  count(name, sex, wt = n) %>%
  group_by(name) %>%
  mutate(share = n / sum(n)) %>%
  ungroup()

tribble(
  ~respondent, ~friend,
  "001", "John, Mary, Riley",
  "002", "Sue, John, Peter") %>%
  separate_rows(friend, sep = ", ") %>%
  left_join(gender_freq, by = c("friend" = "name")) %>%
  count(respondent, sex, wt = share)


## A tibble: 4 x 3
#  respondent sex       n
#  <chr>      <chr> <dbl>
#1 001        F      1.53
#2 001        M      1.47
#3 002        F      1.00
#4 002        M      2.00
Jon Spring
  • 55,165
  • 4
  • 35
  • 53
0

Assuming you have a list that links a name with gender, you can split up your friend column, merge the result with your list and summarise on the gender:

library(tidyverse)
df <- tibble(
  respondent = c('001', '002'), 
  friend = c('John, Mary', 'Sue, John, Peter')
)

names_df <- tibble(
  name = c('John', 'Mary', 'Sue','Peter'),
  gender = c('M', 'F', 'F', 'M')
)

df %>% 
  mutate(friend = strsplit(as.character(friend), ", ")) %>% 
  unnest(friend) %>% 
  left_join(names_df, by = c('friend' = 'name')) %>% 
  group_by(respondent) %>% 
  summarise(male_friends = sum(gender == 'M'), 
            female_friends = sum(gender == 'F'))

resulting in

# A tibble: 2 x 3
  respondent male_friends female_friends
* <chr>             <int>          <int>
1 001                   1              1
2 002                   2              1
tivd
  • 750
  • 3
  • 17