Extract a sample of words around a particular word using stringr in R

Question

I've seen a couple of similar questions posted on SO regarding this topic, but they seem to be worded improperly (example) or in a different language (example).

In my scenario, I consider everything that is surrounded by white space to be a word. Emoticons, numbers, strings of letters that aren't really words, I don't care. I just want to get some context around the string that was found without having to read the entire file to figure out if it's a valid match.

I tried using the following, but it takes awhile to run if you've got a long text file:

text <- "He served both as Attorney General and Lord Chancellor of England. After his death, he remained extremely influential through his works, especially as philosophical advocate and practitioner of the scientific method during the scientific revolution. Bacon has been called the father of empiricism.[6] His works argued for the possibility of scientific knowledge based only upon inductive and careful observation of events in nature. Most importantly, he argued this could be achieved by use of a skeptical and methodical approach whereby scientists aim to avoid misleading themselves. While his own practical ideas about such a method, the Baconian method, did not have a long lasting influence, the general idea of the importance and possibility of a skeptical methodology makes Bacon the father of scientific method. This marked a new turn in the rhetorical and theoretical framework for science, the practical details of which are still central in debates about science and methodology today. Bacon was knighted in 1603 and created Baron Verulam in 1618[4] and Viscount St. Alban in 1621;[3][b] as he died without heirs, both titles became extinct upon his death. Bacon died of pneumonia in 1626, with one account by John Aubrey stating he contracted the condition while studying the effects of freezing on the preservation of meat."

stringr::str_extract(text, "(.*?\\s){1,10}Verulam(\\s.*?){1,10}")

I'm assuming there is a much, much faster/more efficient way in which to do this, yes?

do you only care about the first string match? I would think you want more than that. — fishtank, Dec 21 '15 at 20:44
@fishtank I'd want more than the first, which is why I tweaked the answer below to use `stringr::str_extract_all` as opposed to `stringr::str_extract` — tblznbits, Dec 21 '15 at 20:47

Jota · Accepted Answer · 2015-12-21T20:50:01.100

7

Try this:

stringr::str_extract(text, "([^\\s]+\\s){3}Verulam(\\s[^\\s]+){3}")
# alternately, if you like " " more than \\s:
# stringr::str_extract(text, "(?:[^ ]+ ){3}Verulam(?: [^ ]+){3}")

#[1] "and created Baron Verulam in 1618[4] and"

Change the number inside the {} to suit your needs.

You can use non-capture (?:) groups, too, though I'm not sure yet whether that will improve speed.

stringr::str_extract(text, "(?:[^\\s]+\\s){3}Verulam(?:\\s[^\\s]+){3}")

edited Dec 21 '15 at 20:50

answered Dec 21 '15 at 20:31

Jota

17,281
7
63
93

1

I really like the one-line approach. It's clean and the regex isn't hard to understand. I made a slight modification in my use case to allow for matches that might be inside of paretheses or at the end of a sentence, in addition to allowing for a varying number of words before and after for a scenario where the word is towards the end of the text. It also matches all instances of the word instead of the first. `stringr::str_extract_all(text, "([^\\s]+\\s){1,5}Verulam(\\s[^\\s]+){1,5}")` – tblznbits Dec 21 '15 at 20:43
1

That should read `stringr::str_extract_all(text, "([^\\s]+\\s){1,5}Verulam.?(\\s[^\\s]+){1,5}")` instead of what it says. I just realized it and can't edit the comment now. The added `.?` allows a period, comma, or parenthesis after the word. – tblznbits Dec 21 '15 at 20:55
@brittenb I think you want `{0,5}` instead of `{1,5}` if you want words that are beginning or end of the text. – fishtank Dec 21 '15 at 22:55

arvi1000 · Answer 2 · 2015-12-22T02:27:37.577

I'd use unlist(strsplit) and then index the resulting vector. You could make it a function so that the number of words to fetch pre and post is a flexible parameter:

getContext <- function(text, look_for, pre = 3, post=pre) {
  # create vector of words (anything separated by a space)
  t_vec <- unlist(strsplit(text, '\\s'))

  # find position of matches
  matches <- which(t_vec==look_for)

  # return words before & after if any matches
  if(length(matches) > 0) {
    out <- 
      list(before = ifelse(m-pre < 1, NA, 
                           sapply(matches, function(m) t_vec[(m - pre):(m - 1)])), 
           after = sapply(matches, function(m) t_vec[(m + 1):(m + post)]))

    return(out)
  } else {
    warning('No matches')
  }
}

Works for a single match

getContext(text, 'Verulam')

# $before
#      [,1]     
# [1,] "and"    
# [2,] "created"
# [3,] "Baron"  
# 
# $after
#      [,1]     
# [1,] "in"     
# [2,] "1618[4]"
# [3,] "and"

Also works if there's more than one match

getContext(text, 'he')

# $before
#      [,1]     [,2]           [,3]          [,4]     
# [1,] "After"  "nature."      "in"          "John"   
# [2,] "his"    "Most"         "1621;[3][b]" "Aubrey" 
# [3,] "death," "importantly," "as"          "stating"
# 
# $after
#      [,1]          [,2]     [,3]      [,4]        
# [1,] "remained"    "argued" "died"    "contracted"
# [2,] "extremely"   "this"   "without" "the"       
# [3,] "influential" "could"  "heirs,"  "condition" 

getContext(text, 'fruitloops')
# Warning message:
#   In getContext(text, "fruitloops") : No matches

nice solution but need to handle the negative indexing else `getContext(text, "He")` won't work. — fishtank, Dec 21 '15 at 20:37
Yeah, I really like this solution as well, but the one-liner below, with some edits, better suits this situation. — tblznbits, Dec 21 '15 at 20:40
@fishtank - good point, edited. Also thought of using `pmin(0, m - pre)`, but this way the "out of bounds" result will be the same for both 'before' and 'after' items (i.e. both NA) — arvi1000, Dec 22 '15 at 02:29

durum · Answer 3 · 2015-12-21T20:39:15.387

If you do not mind to triplicate the data, you can make a data.frame, which is normally the best option to work with in R.

context <- function(text){
  splittedText <- strsplit(text, ' ', T)[[1]]
  print(splittedText)

  data.frame(
    words  = splittedText,
    before = head(c('', splittedText), -1), 
    after  = tail(c(splittedText, ''), -1)
  )
}

Much cleaner IMO:

info <- context(text)

print(subset(info, words == 'Verulam'))

print(subset(info, before == 'Lord'))

print(subset(info, grepl('[[:digit:]]', words)))

#       words before #after
# 161 Verulam  Baron    in
#        words before after
# 9 Chancellor   Lord    of
#             words before after
# 43  empiricism.[6]     of   His
# 157           1603     in   and
# 163        1618[4]     in   and
# 169    1621;[3][b]     in    as
# 187          1626,     in  with

Extract a sample of words around a particular word using stringr in R

3 Answers3

Linked