str_extract_all - find only exact strings from a list

Question

I would like to use str_extract_all to extract specific text strings from many columns of a spreadsheet containing error descriptions. A sample list:

fire_match <- c('fire', 'burned', 'burnt', 'burn', 'injured', 'injury', 'hurt', 'dangerous', 
  'accident', 'collided', 'collide', 'crashed', 'crash', 'smolder', 'flame', 'melting', 
  'melted', 'melt', 'danger')

My code technically does what it is supposed to do, but I am also extracting (for example) 'fire' from 'misfire'. This is incorrect. I am also having difficulty extracting results that are not case sensitive.

This is a direct example of what is getting me 90% of the way there:

fire$Cause.Trigger <- str_extract_all(CAUSE_TEXT, paste(fire_match, collapse="|") )

My desired result is:

CAUSE_TEXT <- c("something caught fire", "something misfired", 
  "something caught Fire", "Injury occurred")

something caught fire -> fire
something misfired -> N/A
something caught Fire -> fire
an Injury occurred -> injury

That's going to be tough because this stuff is confidential. In general terms I have several columns that have manually entered incident reports that I need to extract specific words out of. I can't do too much better than the left side of my desired result, but hopefully someone who has used stringr more than me sees what's probably my obvious error :) — Gib999, May 07 '19 at 19:17

score 3 · Accepted Answer · answered May 07 '19 at 19:20

You can just add \b to your individial terms to make sure they match a word boundry.

pattern <- paste0("\\b", paste(fire_match , collapse="\\b|\\b"), "\\b")
str_extract_all(CAUSE_TEXT, regex(pattern, ignore_case = TRUE))
# [[1]]
# [1] "fire"
# [[2]]
# character(0)
# [[3]]
# [1] "Fire"
# [[4]]
# [1] "Injury"

str_extract_all - find only exact strings from a list

1 Answers1