0

I‘ve imported a survey data 'df' in R and would like to convert/split a character variable 'symptom' to a set of binary variable and an another character variable according to its stored responses. 'symptom' variable records information on all responses to a multiple choices question 'What symptoms are you experiencing?'. Respondents ticked the box(es) that best describe their symptoms and corresponding options will be stored in 'symptom' as strings.

Q: What symptoms are you experiencing?

  • Quickly fall into sleep, but wake up shortly
  • Feel emotionally, physically weak
  • Sleep paralysis. i.e., wide awake but can't move your body
  • Lose weight quickly, lack of appetite
  • Other, ___

Here is a reproducible data frame

df = data.frame(
  id = c(1,2,3,4),
  symptom = c("Quickly fall into sleep, but wake up shortly, Feel emotionally, physically weak, Sleep paralysis. i.e., wide awake but can't move your body","Feel emotionally, physically weak, Lose weight quickly, lack of appetite","Sleep paralysis. i.e., wide awake but can't move your body, Other, increased dreaming","Sleep paralysis. i.e., wide awake but can't move your body"))

For example, Mike ticked 1,2,3 and then his corresponding value in 'symptom' variable is 'Quickly fall into sleep, but wake up shortly, Feel emotionally, physically weak, Sleep paralysis. i.e., wide awake but can't move your body'. These strings are separated by commas. While someone ticked the fifth box, other symptoms are required to be written down in underlined area and stored in 'symptom' too. e.g., 'Lose weight quickly, lack of appetite, Other,increased dreaming'

I have tried to use lappy(), gsub(), grepl() but not worked.

lapply(adult$narco_cause1, gsub, pattern="Quickly fall into sleep, but wake up shortly", replacement=1)

It is expected to create 5 binary variable to denote which symptoms that respondents have. 1 == yes, 0 == no. And for those answered with 'other,' option, another character variable will be created to record these uncategorical information as strings.

Thanks in advance.

expected output https://i.stack.imgur.com/8CjbT.png

  • It's easier to help you if you include a simple [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input and desired output that can be used to test and verify possible solutions. Do not post pictures of data because then we have to retype everything just to try out the code. – MrFlick Apr 06 '22 at 06:17
  • Thanks for your suggestion @MrFlick I add a reproducible example now. – Sinkiewicz Apr 06 '22 at 07:03

1 Answers1

0

The issue is that commas have two meanings in your data - sometimes they are used as punctuation, and sometimes to separate two values.

You can use str_detect from stringr to identify values with the symptom, and mutate + ifelse to create new columns. I used sub to isolate what comes to the right of Other, .

library(dplyr)
library(stringr)
df |> mutate(
        symptom1 = ifelse(str_detect(symptom, pattern = "Quickly fall into sleep, but wake up shortly"), 1, 0),
        symptom2 = ifelse(str_detect(symptom, "Feel emotionally, physically weak"), 1, 0),
        symptom3 = ifelse(str_detect(symptom, "Sleep paralysis. i.e., wide awake but can't move your body"), 1, 0),
        symptom4 = ifelse(str_detect(symptom, "Lose weight quickly, lack of appetite"), 1, 0),
        symptom5 = ifelse(str_detect(symptom, "Other"), 1, 0),
        other_symptom = ifelse(symptom5 == 1, sub(".*Other, ", "", symptom), NA_character_))

I suspect there's a more succinct way to do this - happy to read other answers.

Andrea M
  • 2,314
  • 1
  • 9
  • 27