1

I would like to use str_extract_all to extract specific text strings from many columns of a spreadsheet containing error descriptions. A sample list:

fire_match <- c('fire', 'burned', 'burnt', 'burn', 'injured', 'injury', 'hurt', 'dangerous', 
  'accident', 'collided', 'collide', 'crashed', 'crash', 'smolder', 'flame', 'melting', 
  'melted', 'melt', 'danger')

My code technically does what it is supposed to do, but I am also extracting (for example) 'fire' from 'misfire'. This is incorrect. I am also having difficulty extracting results that are not case sensitive.

This is a direct example of what is getting me 90% of the way there:

fire$Cause.Trigger <- str_extract_all(CAUSE_TEXT, paste(fire_match, collapse="|") )

My desired result is:

CAUSE_TEXT <- c("something caught fire", "something misfired", 
  "something caught Fire", "Injury occurred")
  • something caught fire -> fire
  • something misfired -> N/A
  • something caught Fire -> fire
  • an Injury occurred -> injury
Dave Gruenewald
  • 5,329
  • 1
  • 23
  • 35
Gib999
  • 93
  • 1
  • 8
  • 1
    Can you show a reproducible example `CAUSE_TEXT` – akrun May 07 '19 at 19:13
  • That's going to be tough because this stuff is confidential. In general terms I have several columns that have manually entered incident reports that I need to extract specific words out of. I can't do too much better than the left side of my desired result, but hopefully someone who has used stringr more than me sees what's probably my obvious error :) – Gib999 May 07 '19 at 19:17

1 Answers1

3

You can just add \b to your individial terms to make sure they match a word boundry.

pattern <- paste0("\\b", paste(fire_match , collapse="\\b|\\b"), "\\b")
str_extract_all(CAUSE_TEXT, regex(pattern, ignore_case = TRUE))
# [[1]]
# [1] "fire"
# [[2]]
# character(0)
# [[3]]
# [1] "Fire"
# [[4]]
# [1] "Injury"
MrFlick
  • 195,160
  • 17
  • 277
  • 295