0

So, I have some students who wrote a google form survey. The survey is in string form where they select a drop down menu that has options like:

I do not feel sad
I feel sad some of the times
I often feel sad
I feel sad all the time

This is a type of question modeled after the beck depression inventory. Each item is associated with a 0,1,2,3, respectively.

There are some 20 of these variables.

So, I have 2 data frames.

df1 has the survey data (strings of responses). Here are 2 of those variables

    head(df1[1:7,c('sad','optimism')])
    sad                                       optimism                                
     <chr>                                     <chr>                                   
    1 Throughout the day I sometimes feel sad   I am somewhat optimistic about my future
    2 Throughout the day I sometimes feel sad   I am somewhat optimistic about my future
    3 Throughout the day I sometimes feel happy I feel discouraged about the future     
    4 Throughout the day I sometimes feel happy I am optimistic about my future         
    5 Throughout the day I sometimes feel happy I am somewhat optimistic about my future
    6 Throughout the day I sometimes feel happy I am somewhat optimistic about my future
    7 Throughout the day I sometimes feel happy I feel discouraged about the future   

df2 has a key of conditions

    head(df2[1:4,c('sad','optimism')])
     sad                                   optimism                                              
      <chr>                                 <chr>                                                 
    1 Throughout the day I feel happy       I am optimistic about my future                       
    2 Throughout the day I sometimes feel … I am somewhat optimistic about my future              
    3 Throughout the day I sometimes feel … I feel discouraged about the future                   
    4 Throughout the day I feel sad         I feel the future is hopeless and that things cannot …

The variable names are the same in each dataframe.

I want to use dplyr's case_when using pipes to take each variable from df1 and compare it to the appropriate column in df2.

The following code actually works at converting the string to a number, but if you notice the case_when conditional checks the entire row of the dataframe, which is completely unnecessary. I want check simply the df1$sad variable from the survey with the df2$sad of the key.

    df1 %>%   mutate(across(x,~case_when(

    # The following lines of code checks a given record statement 
    # with ALL columns. Should only check indexed column
                                          . %in% df2[2,] ~ 0, #checks across all variables in df2; I just want to check a single column  
                                          . %in% df2[3,] ~ 1,
                                          . %in% df2[4,] ~ 2,
                                          . %in% df2[5,] ~ 3)))

So, some questions:

  1. I'm not sure case_when can do this
  2. If it does, I'm wondering if I need to use some dot notation
  3. or maybe there is a better solution

possible answers that I don't understand (yet)

[1.]: dplyr case_when This might be the best bet...Not sure how to wrap my head around it all.

[2.]: dplyr case_when Programmatically

  1. dplyr case_when multiple cases looks promising
General Grievance
  • 4,555
  • 31
  • 31
  • 45
Brian Holt
  • 75
  • 9

2 Answers2

1

I think this needs left_joins and not case_when().

Create tables:

library(tidyverse)

df1 <- tibble::tribble(
  ~x, ~sad, ~optimism,
  "1", "Throughout the day I sometimes feel sad", "I am somewhat optimistic about my future",
  "2", "Throughout the day I sometimes feel sad", "I am somewhat optimistic about my future",
  "3", "Throughout the day I sometimes feel happy", "I feel discouraged about the future",
  "4", "Throughout the day I sometimes feel happy", "I am optimistic about my future",
  "5", "Throughout the day I sometimes feel happy", "I am somewhat optimistic about my future",
  "6", "Throughout the day I sometimes feel happy", "I am somewhat optimistic about my future",
  "7", "Throughout the day I sometimes feel happy", "I feel discouraged about the future"
)

df2 <- tibble::tribble(
  ~y, ~sad, ~optimism,
   "1", "Throughout the day I feel happy", "I am optimistic about my future"
  ,"2", "Throughout the day I sometimes feel happy", "I am somewhat optimistic about my future"
  ,"3", "Throughout the day I sometimes feel sad", "I feel discouraged about the future"
  ,"4", "Throughout the day I feel sad", "I feel the future is hopeless and that things cannot"
)

Join onto the lookup table df2. Note it's possible to reduce the lookup table to just one column at a time (so first is sad and second is optimism):

df1 %>% 
  left_join(df2 %>% 
              select(y,
                     sad), by = "sad") %>% 
  left_join(df2 %>% 
              select(y,
                     optimism), by = "optimism") %>% 
# columns can be renamed within the select statement which is useful to reorder the coded columns next to the text
  select(x,
         sad,
         sad_coded = y.x,
         optimism,
         optimism_coded = y.y
         )

I hope this helps - please let me know if you were expecting different output.

# A tibble: 7 x 5
  x     sad                        sad_coded optimism                  optimism_coded
  <chr> <chr>                      <chr>     <chr>                     <chr>         
1 1     Throughout the day I some~ 3         I am somewhat optimistic~ 2             
2 2     Throughout the day I some~ 3         I am somewhat optimistic~ 2             
3 3     Throughout the day I some~ 2         I feel discouraged about~ 3             
4 4     Throughout the day I some~ 2         I am optimistic about my~ 1             
5 5     Throughout the day I some~ 2         I am somewhat optimistic~ 2             
6 6     Throughout the day I some~ 2         I am somewhat optimistic~ 2             
7 7     Throughout the day I some~ 2         I feel discouraged about~ 3  
Zoë Turner
  • 459
  • 5
  • 8
  • Thank you. I was getting to the conclusion that a left_join but I'm now struggling how to do this with many variables. So instead of just 'sad' and 'optimism' there are 20 other variables to do this. The piping works for a few variables, but I"m not seeing a way to scale it – Brian Holt Mar 03 '21 at 22:52
  • 1
    I'm thinking of changing the nature of the 2nd data frame. Instead of it being wide, I'd make it long/tall so that the first column would contain all of the possible survey statements and the 2nd column would repeat 0:3. And I think then the left_join would work for all the variables. – Brian Holt Mar 03 '21 at 23:03
  • 1
    Yes, definitely a better approach making the df2 longer. – Zoë Turner Mar 04 '21 at 14:34
1

Instead of using case_when and matching one-by-one each sentence in df2 you can use match which will give index of match. In base R, you can use Map.

cols <- names(df1)
df1[paste0(cols, '_num')] <- Map(match, df1[cols], df2[cols])

df1
# A tibble: 7 x 4
#   sad                                       optimism                                sad_num optimism_num
#  <chr>                                     <chr>                                     <int>        <int>
#1 Throughout the day I sometimes feel sad   I am somewhat optimistic about my futu…       3            2
#2 Throughout the day I sometimes feel sad   I am somewhat optimistic about my futu…       3            2
#3 Throughout the day I sometimes feel happy I feel discouraged about the future           2            3
#4 Throughout the day I sometimes feel happy I am optimistic about my future               2            1
#5 Throughout the day I sometimes feel happy I am somewhat optimistic about my futu…       2            2
#6 Throughout the day I sometimes feel happy I am somewhat optimistic about my futu…       2            2
#7 Throughout the day I sometimes feel happy I feel discouraged about the future           2            3

Or map2 in purrr if you want a tidyverse option -

df1[paste0(cols, '_num')] <- purrr::map2(df1[cols], df2[cols], match)

data

df1 <- structure(list(sad = c("Throughout the day I sometimes feel sad", 
"Throughout the day I sometimes feel sad", "Throughout the day I sometimes feel happy", 
"Throughout the day I sometimes feel happy", "Throughout the day I sometimes feel happy", 
"Throughout the day I sometimes feel happy", "Throughout the day I sometimes feel happy"
), optimism = c("I am somewhat optimistic about my future", "I am somewhat optimistic about my future", 
"I feel discouraged about the future", "I am optimistic about my future", 
"I am somewhat optimistic about my future", "I am somewhat optimistic about my future", 
"I feel discouraged about the future")), row.names = c(NA, -7L
), class = c("tbl_df", "tbl", "data.frame"))

df2 <- structure(list(sad = c("Throughout the day I feel happy", "Throughout the day I sometimes feel happy", 
"Throughout the day I sometimes feel sad", "Throughout the day I feel sad"
), optimism = c("I am optimistic about my future", "I am somewhat optimistic about my future", 
"I feel discouraged about the future", "I feel the future is hopeless and that things cannot"
)), row.names = c(NA, -4L), class = c("tbl_df", "tbl", "data.frame"
))
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213