0

I am trying to detect detect if certain combinations of patterns are present/absent in one variable in a dataframe.

There are some questions that are similar, but I could not find one that answers exactly what I am trying to achieve.

I am trying to find:

  • if the patterns are present
  • defining multiple patterns using logical operators (and, or , not = $, |, !)
  • ignore case
  • return output as another column with true/false

I still can not find a fix but I will share what I did so far, to get your guidance:

Create a sample dataframe

x=structure(list(Sources = structure(c(1L, 7L, 6L, 8L, 9L, 4L,
3L, 5L, 2L), .Label = 
  c("Found in all nutritious foods in moderate amounts: pork, whole grain foods or enriched breads and cereals, legumes, nuts and seeds",
  
"Found only in fruits and vegetables, especially citrus fruits, vegetables in the cabbage family, cantaloupe, strawberries, peppers, tomatoes, potatoes, lettuce, papayas, mangoes, kiwifruit",
  
"Leafy green vegetables and legumes, seeds, orange juice, and liver; now added to most refined grains",
"Meat, fish, poultry, vegetables, fruits", 
  "Meat, poultry, fish, seafood, eggs, milk and milk products; not found in plant foods",
"Meat, poultry, fish, whole grain foods, enriched breads and cereals, vegetables (especially mushrooms, asparagus, and leafy green vegetables), peanut butter",
  
"Milk and milk products; leafy green vegetables; whole grain foods, enriched breads and cereals",
"Widespread in foods", "Widespread in foods; also produced in intestinal tract by bacteria"
), class = "factor")), class = "data.frame", row.names = c(NA,
-9L))

this code detects presence of any of the 2 specified strings (?i) means ignore case.

x$present = str_detect(x$Sources, "(?i)Vegetables|(?i)Meat")

# but it does not work with "and"
x$present =str_detect(x$Sources, "(?i)Vegetables&(?i)Meat")

#here it gives FALSE for all, my expected output is to return TRUE for those that contain both words

This one works by filtering the desired combination:

  • it works with | & !
  • but it only filters the rows of interest, is there a way to add another column to the dataset with true if the pattern is present?
x %>% filter (str_detect(x$Sources, "(?i)Vegetables") & str_detect(x$Sources, "(?i)Meat"))
 
x %>% filter (str_detect(x$Sources, "(?i)Vegetables") & !str_detect(x$Sources, "(?i)Meat")) #does not contain meat

x %>% filter (!str_detect(x$Sources, "(?i)Meat") & str_detect(x$Sources, "(?i)Vegetables") & str_detect(x$Sources, "(?i)Grain"))

Finally, I found this package which looks like it can do the job, but it only works with vectors, is there a way to make it work for variables in dataframe? like using lapply or something to return another variable with True/False?

library(sjmisc)
 
str_contains(x$Sources, "Meat", ignore.case = T)
Bahi8482
  • 489
  • 5
  • 15

2 Answers2

1

Use mutate with str_detect to create the new column:

library(tidyverse)

x %>% 
  mutate(pattern_detected = 
           str_detect(Sources, "(?i)Vegetables") & 
           str_detect(Sources, "(?i)Meat"))
andrew_reece
  • 20,390
  • 3
  • 33
  • 58
1

Using the function from sjmisc package over a data.frame. The workhorse here is sapply twice - once for the columns in the data.frame and once for the rows.

library(sjmisc)
# build dummy data.frame
df <- data.frame(x, x, x)

sapply(df, function(x) sapply(x, 
                             str_contains, 
                             pattern = c("Meat", "Vegetables"), 
                             logic = "and", ignore.case = TRUE))
         Sources Sources.1 Sources.2
 [1,]   FALSE     FALSE     FALSE
 [2,]   FALSE     FALSE     FALSE
 [3,]    TRUE      TRUE      TRUE
 [4,]   FALSE     FALSE     FALSE
 [5,]   FALSE     FALSE     FALSE
 [6,]    TRUE      TRUE      TRUE
 [7,]   FALSE     FALSE     FALSE
 [8,]   FALSE     FALSE     FALSE
 [9,]   FALSE     FALSE     FALSE

The output is a matrix. If you want a data.frame, wrap it in as.data.frame.

as.data.frame(sapply(df, function(x) sapply(x, 
                                            str_contains, 
                                            pattern = c("Meat", "Vegetables"), 
                                            logic = "and", ignore.case = TRUE)))

  Sources Sources.1 Sources.2
1   FALSE     FALSE     FALSE
2   FALSE     FALSE     FALSE
3    TRUE      TRUE      TRUE
4   FALSE     FALSE     FALSE
5   FALSE     FALSE     FALSE
6    TRUE      TRUE      TRUE
7   FALSE     FALSE     FALSE
8   FALSE     FALSE     FALSE
9   FALSE     FALSE     FALSE
Ben Norris
  • 5,639
  • 2
  • 6
  • 15
  • thanks for your help. I also got this from the package repository which can also do the job https://github.com/strengejacke/sjmisc/issues/141 just sharing as it may be useful from someone. – Bahi8482 Oct 31 '20 at 01:44