Tidyverse: Match word in string from list of keywords

Question

I'm trying to write some code that will check to see if a string contains any words contained in a list of terms, in order to create a new column in the dataframe.

This is the list of terms: vehicles <- c('vehicle', 'mazda', 'nissan', 'ford', 'honda', 'chevrolet', 'toyota')

Examples of the strings I'm searching include: "2001 honda civic", "2003 nissan altima", "2005 mazda 5", etc. (these are the asset_name in the code below).

my simplified code looks like this:

df %>%
  mutate(
    asset_type = case_when(
      vehicles %in% asset_name == TRUE ~ 'vehicle', # this doesn't work, obviously
      <CODE THAT DOES WORK HERE!!!>
      TRUE ~ asset_name
    )
  )

I've tried str_detect, str_extract, grepl & a custom function but can't seem to figure out how to make this work.

I know that for each asset_name entry, I need to loop through the list of vehicles to see if one of the vehicle models is in asset_name but I can't seem to make it work. grr...

Thanks in advance!!!

score 3 · Accepted Answer · answered Jan 06 '22 at 05:35

3

One approach might be to build a regex alternation of the vehicle terms, and then use grepl to match:

vehicles <- c('vehicle', 'mazda', 'nissan', 'ford', 'honda', 'chevrolet', 'toyota')
regex <- paste0("\\b(?:", paste(vehicles, collapse="|"), ")\\b")

df %>%
    mutate(
        asset_type = case_when(
            grepl(regex, asset_name) ~ 'vehicle',
            <CODE THAT DOES WORK HERE!!!>
            TRUE ~ asset_name
        )
    )

answered Jan 06 '22 at 05:35

Tim Biegeleisen

502,043
27
286
360

Yes! That totally did the trick. regex to the rescue once again! thanks Tim! – Nathan Jan 06 '22 at 18:06

score 1 · Answer 2 · answered Jan 06 '22 at 10:19

String matching is one most complicated task human ever faced. str_detect() and another equivalent function is looking for ordered-matched case. Like if we look with "mazda" keyword, we won't detect "madza" or "maazda" etc. So, i think, you need something like the mighty fuzzywuzzy to detect similar words (by distance). Go check https://cran.r-project.org/web/packages/fuzzywuzzyR/vignettes/functionality_of_fuzzywuzzyR_package.html . The function is strightforward and easy to use. It might help your problems

score 0 · Answer 3 · answered Jan 06 '22 at 06:37

Adapted from this answer:

library(tidyverse)

vehicles <- c('vehicle', 'mazda', 'nissan', 'ford', 'honda', 'chevrolet', 'toyota')
asset_name <- c("2001 honda civic", "2003 nissan altima", "2005 mazda 5", 
                "unmatched1", "unmatched2") # added unmatched strings
x <- 1:length(asset_name) # dummy variable to make df

df <- data.frame(x, asset_name)

df %>% 
  mutate(asset_type = case_when(
    asset_name %in% unlist(lapply(vehicles, grep, asset_name, value = TRUE)) ~ 'vehicle',
    TRUE ~ asset_name)
    )

Output:

  x         asset_name asset_type
1 1   2001 honda civic    vehicle
2 2 2003 nissan altima    vehicle
3 3       2005 mazda 5    vehicle
4 4         unmatched1 unmatched1
5 5         unmatched2 unmatched2

score 0 · Answer 4 · answered Jan 06 '22 at 08:09

Here is demonstration example how to use grepl, str_detect, str_extract within a dataset:

# 1. create a monk vector
vehicles <- c("Honda", "Cadillac", "Mazda", "Hornet")

# 2. create a `|` separate pattern of your vector as Tim Biegleisen already did
pattern_vehicles <- paste(vehicles, collapse = "|")

# Now 

# grepl returns a logical vector TRUE/FALSE
grepl(pattern_vehicles, rownames(mtcars))
#[1]  TRUE  TRUE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE
#[17] FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

# grep returns the indices of the matching elements
grep(pattern_vehicles, rownames(mtcars))
#[1]  1  2  4  5 15 19

# to get the value names we use the argument value = TRUE of grep
grep(pattern_vehicles, rownames(mtcars), value = TRUE)
#[1] "Mazda RX4"          "Mazda RX4 Wag"      "Hornet 4 Drive"     "Hornet Sportabout" 
#[5] "Cadillac Fleetwood" "Honda Civic"    

# str_detect returns logical vector 
str_detect(rownames(mtcars), pattern_vehicles)
#[1]  TRUE  TRUE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE
#[18] FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

# str_extract extracts the matching values and leaves NA in the not matching
str_extract(rownames(mtcars), pattern_vehicles)
#[1] "Mazda"    "Mazda"    NA         "Hornet"   "Hornet"   NA         NA         NA         NA        
#[10] NA         NA         NA         NA         NA         "Cadillac" NA         NA         NA        
#[19] "Honda"    NA         NA         NA         NA         NA         NA         NA         NA        
#[28] NA


# To apply these vector operation in a dataframe column basically you do the same, 
# unless you want to get the the value in the new column, then you add an `ifelse` statement
# Here is an example:

mtcars %>% 
  select(1) %>% 
  rownames_to_column("cars") %>% 
  mutate(new_grepl_TRUEFALSE = grepl(pattern_vehicles, cars)) %>% 
  mutate(new_grepl_value = ifelse(grepl(pattern_vehicles, cars), cars, NA_character_)) %>% 
  mutate(new_str_detect_TRUEFALSE = str_detect(cars, pattern_vehicles)) %>% 
  mutate(new_str_detect_value = ifelse(str_detect(cars, pattern_vehicles), cars, NA_character_)) %>% 
  mutate(new_str_extract = str_extract(cars, pattern_vehicles)) %>% 
  head()

               cars  mpg new_grepl_TRUEFALSE   new_grepl_value new_str_detect_TRUEFALSE new_str_detect_value new_str_extract
1         Mazda RX4 21.0                TRUE         Mazda RX4                     TRUE            Mazda RX4           Mazda
2     Mazda RX4 Wag 21.0                TRUE     Mazda RX4 Wag                     TRUE        Mazda RX4 Wag           Mazda
3        Datsun 710 22.8               FALSE              <NA>                    FALSE                 <NA>            <NA>
4    Hornet 4 Drive 21.4                TRUE    Hornet 4 Drive                     TRUE       Hornet 4 Drive          Hornet
5 Hornet Sportabout 18.7                TRUE Hornet Sportabout                     TRUE    Hornet Sportabout          Hornet
6           Valiant 18.1               FALSE              <NA>                    FALSE                 <NA>            <NA>

Tidyverse: Match word in string from list of keywords

4 Answers4