5

I have a text document and I'm trying to get the text between the words "abstract" and "keywords" (in R). This is the code I'm using:

gsub(".*abstract\\s*|keywords.*", "\\1", string)

However, this didn't work because somewhere else in the text the word "abstract" occurred so I made it non-greedy like this (added ? in front of abstract)

gsub(".*?abstract\\s*|keywords.*", "\\1", string)

But for some reason it now takes the text between "abstract" and "keywords" (which is what I want), but ALSO the text starting from the second "abstract" appearing in the text, all the way to the end. Any ideas?

  • Possible duplicate of [Extract info inside all parenthesis in R](http://stackoverflow.com/questions/8613237/extract-info-inside-all-parenthesis-in-r) – Barker Jan 20 '17 at 00:31

2 Answers2

3

I think this should give you what you are looking for:

regmatches(string, gregexpr("(?<=abstract).*(?=keywords)", string, perl = TRUE))

What it does:

  • (?<=abstract) use the "look ahead" capabilities to find things after the word "abstract"
  • .* match any number of keywords
  • (?=keywords) use the "look behind" for find things before the word "keywords"
  • gregexpr looks for the given regular expression in string
  • perl = TRUE allows for the "look ahead" and "look behind" functionality
  • regmatches pulls out the matching piece of the string using the regular expression.
Barker
  • 2,074
  • 2
  • 17
  • 31
2

it doesn't look like you are capturing anything in your search term, you just need some ()'s in there to actually grab something so \\1 will return your target :

words <- c("these are some different abstract words that might be between keywords or they might just be bounded by abstract ideas")
gsub(".* abstract (.*) keywords.*", "\\1", words)
[1] "words that might be between"
Nate
  • 10,361
  • 3
  • 33
  • 40
  • Hey, thanks for the quick answer! To be honest, I'm not good at regex, I just found the command by searching it on Google. One more question though, I used the exact same command to get text between "abstract" and "introduction" and for some reason that one does work. Do you know why? Here is the code: gsub(".*abstract\\s*|introduction.*", "", words) –  Jan 20 '17 at 00:26
  • 1
    first here is my favorite cheat sheet: https://www.cheatography.com/davechild/cheat-sheets/regular-expressions/ (I use that thing A LOT, because regex is hard) – Nate Jan 20 '17 at 00:32
  • Yes I know, but my question was why that piece of code I posted works (believe it or not, but it does actually work for the introduction part and not for the keywords part, which is why I'm confused). Thanks again! –  Jan 20 '17 at 00:40
  • ooooh i see, i can't tell you for certain without seeing the actual text you are working with, likely what happened is by subbing both of your search terms with empty strings (`\\1` was really `""`, since nothing was being caught), all you were left with was your target, but like you saw later that strategy isn't always going to give you words between your "boundaries" – Nate Jan 20 '17 at 00:43