1

I have a dataset where respondents could select multiple responses for the same question, one describing their nationality. Most only selected one category, whereas some selected multiple (including a free text entry which I will report the entries to separately). I want to know how to honour people who have selected multiple responses without distorting the rest of the data

Effectively, all I want to do is get basic demographics from this (n, mean, sd, etc.), so I am okay with the sum count of different nationality groups within my sample exceeding the number of participants (unless there is some reason this is a bad idea that I haven't thought of, in which case please say). I ran my columns through as.numeric(), which responded that some values were coerced to NAs (those with multiple responses)- I know how to fix this error with e.g. gsub(",", "") but not in a meaningful way that preserves these people's answers. I saw a couple of solutions to this question here, but I'm still an R beginner so I'm unsure what the best route is.

I would be interested in any solutions wherein I can count those who selected multiple answers to this question as their own group, as well as within their original categories. e.g. One table with English: 5, Welsh: 3, Scottish: 2, Northern Irish: 1, British: 4, Other: 0; One table with English: 3, Welsh: 1, Scottish: 1, Northern Irish: 1, British: 3, Other: 0, Multiple selected: 2.

Dummy data is as follows:

Nationality <- c(1, "1,2,3,5", 2, "1,2,5", 1, 1, 3, 5, 5, 4)

I also later re-code the numeric values to display the choice text, as below:

df <- df %>%
  mutate(Nationality = recode(Nationality, 
                            '1' = 'English', 
                            '2' = 'Welsh',
                            '3' = 'Scottish',
                            '4' = 'Northern Irish',
                            '5' = 'British',
                            '6' = 'Other'))

Here's the code I will run it through to get demographic statistics:

df %>%
  group_by(Nationality) %>%
  summarise(n = n()) %>%
  mutate(Percentage = round(100*(n / sum(n)), 2))

I tried converting the relevant columns of my data set to numeric (including the column for nationality)

df <- df %>% mutate(across(c(1, 2, 4, 5, 7, 13:57), as.numeric))

Which, as predicted, returned the 'Warning: NAs introduced by coercion'. I've thought about extracting the column and using the solutions in the post I linked but haven't had any luck.

Not posted a question before, so if I need to provide any more info please let me know. I hope I've explained it well enough to give the gist of the problem.

stefan
  • 90,330
  • 6
  • 25
  • 51
PsyCoder
  • 11
  • 1

1 Answers1

1

We may either separate the column into longer before doing this or use str_replace to modify the values and then separate before doing the group by summarise

library(dplyr)
library(stringr)
library(tidyr)
df %>% 
 mutate(Nationality = str_replace_all(Nationality, c('1' = 'English', 
                            '2' = 'Welsh',
                            '3' = 'Scottish',
                            '4' = 'Northern Irish',
                            '5' = 'British',
                            '6' = 'Other'))) %>% 
  separate_longer_delim(Nationality, delim = ",") %>%   
  group_by(Nationality) %>%
  summarise(n = n()) %>%
  mutate(Percentage = round(100*(n / sum(n)), 2))

-output

# A tibble: 5 × 3
  Nationality        n Percentage
  <chr>          <int>      <dbl>
1 British            4      26.7 
2 English            5      33.3 
3 Northern Irish     1       6.67
4 Scottish           2      13.3 
5 Welsh              3      20   
akrun
  • 874,273
  • 37
  • 540
  • 662
  • Thank you, this worked! Had to solve lot of issues with my updates/packages to get it working but I got it and ended up solving some other problems I was having with R. – PsyCoder Mar 31 '23 at 11:38