3

I'm trying to use str_detect and case_when to recode strings based on multiple patterns, and paste each occurance of the recoded value(s) into a new column. The Correct column is the output I'm trying to achieve.

This is similar to this question and this question If it can't be done with case_when (limited to one pattern I think) is there a better way this can be achieved still using tidyverse?

Fruit=c("Apples","apples, maybe bananas","Oranges","grapes w apples","pears")
Num=c(1,2,3,4,5)
data=data.frame(Num,Fruit)

df= data %>% mutate(Incorrect=
paste(case_when(
  str_detect(Fruit, regex("apples", ignore_case=TRUE)) ~ "good",
  str_detect(Fruit, regex("bananas", ignore_case=TRUE)) ~ "gross",
  str_detect(Fruit, regex("grapes | oranges", ignore_case=TRUE)) ~ "ok",
  str_detect(Fruit, regex("lemon", ignore_case=TRUE)) ~ "sour",
  TRUE ~ "other"
),sep=","))

  Num                 Fruit Incorrect
  1                Apples      good
  2 apples, maybe bananas      good
  3               Oranges      other
  4       grapes w apples      good
  5                pears       other

 Num                 Fruit    Correct
   1                Apples       good
   2 apples, maybe bananas good,gross
   3               Oranges         ok
   4       grapes w apples    ok,good
   5                pears       other
W148SMH
  • 152
  • 1
  • 11
  • Related https://stackoverflow.com/questions/53851627/how-to-detect-more-than-one-regex-in-a-case-when-statement & https://stackoverflow.com/questions/56588108/case-when-with-partial-string-match-and-contains – Tung Nov 13 '20 at 00:18

1 Answers1

6

In case_when if a condition is satisfied for one row it stops there and doesn't check for any more conditions. So usually in such cases it is better to have every entry in separate row so that it easier to assign value and then summarise all of them together. However, in this case Fruit column does not have a clear separator, some fruits are separated by comma (,), some are with whitespace and also there are additional words between them. To handle all such cases we assign NA to the words which do not match and then remove them during summarising.

library(dplyr)
library(stringr)

data %>%
  tidyr::separate_rows(Fruit, sep = ",|\\s+") %>%
   mutate(Correct = case_when(
      str_detect(Fruit, regex("apples", ignore_case=TRUE)) ~ "good",
      str_detect(Fruit, regex("bananas", ignore_case=TRUE)) ~ "gross",
      str_detect(Fruit, regex("grapes|oranges", ignore_case=TRUE)) ~ "ok",
      str_detect(Fruit, regex("lemon", ignore_case=TRUE)) ~ "sour",
      TRUE ~ NA_character_)) %>% 
   group_by(Num) %>%
   summarise(Correct = toString(na.omit(Correct))) %>%
   left_join(data)

#   Num Correct     Fruit                
#  <dbl> <chr>       <fct>                
#1     1 good        Apples               
#2     2 good, gross apples, maybe bananas
#3     3 ok          Oranges              
#4     4 ok, good    grapes w apples      
#5     5 sour        Lemons               

For the updated data, we can remove the extra words which occur and do

data %>%
  mutate(Fruit = gsub("maybe|w", "", Fruit)) %>%
  tidyr::separate_rows(Fruit, sep = ",\\s+|\\s+") %>%
  mutate(Correct = case_when(
     str_detect(Fruit, regex("apples", ignore_case=TRUE)) ~ "good",
     str_detect(Fruit, regex("bananas", ignore_case=TRUE)) ~ "gross",
     str_detect(Fruit, regex("grapes|oranges", ignore_case=TRUE)) ~ "ok",
     str_detect(Fruit, regex("lemon", ignore_case=TRUE)) ~ "sour",
     TRUE ~ "other")) %>% 
  group_by(Num) %>%
  summarise(Correct = toString(na.omit(Correct))) %>%
  left_join(data)

#    Num Correct     Fruit                
#  <dbl> <chr>       <fct>                
#1     1 good        Apples               
#2     2 good, gross apples, maybe bananas
#3     3 ok          Oranges              
#4     4 ok, good    grapes w apples      
#5     5 other       pears                
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
  • The only issue is `TRUE ~ NA_character_` . I want meaningful non-matching strings to be coded as `TRUE ~ other`. I edited the data to better reflect my actual data. @RonakShah – W148SMH Nov 30 '19 at 01:05
  • @W148SMH As mentioned in my post the problem arises because there is no clear separator between each fruits. Sometimes they are separated by comma , sometimes by space. So I have separated by both but there are some non-matching words already like `maybe`, `w`. If we give `TRUE ~ 'other'` then those words would also be assigned `'other'`. – Ronak Shah Nov 30 '19 at 01:15
  • If I remove `maybe` and `w` in the beginning with something like `str_replace(Fruit,"maybe|w",""))` it still wants to add `other` after those words are removed @RonakShah – W148SMH Nov 30 '19 at 01:54
  • @W148SMH yes, if those are the only words occurring then you can remove them. See updated answer. – Ronak Shah Nov 30 '19 at 02:59