1

In R, I need to extract "Eight" from the following string:

this_str <- " Eight years blah blah 50 blah blah, two years blah blah blah."

Here is my attempt using gsub:

gsub("^.*\\s([^ ]*)\\s(years|months)\\s.*", "\\1", this_str)

But this returns "two", which corresponds to the second occurrence of the pattern indicated in gsub(). In other posts it is said that sub() should return the first match. But when I use sub() it also gives "two".

ben
  • 787
  • 3
  • 16
  • 32

2 Answers2

2

sub does a single replacement, while gsub does multiple ones. Instead the issue is that .* at the beginning is greedy: it goes up to "two" (i.e., includes all but the last match). Instead we want to be lazy (see here) and match as little as possible:

sub("^.*?\\s([^ ]*)\\s(years|months)\\s.*", "\\1", this_str)
# [1] "Eight"
Julius Vainora
  • 47,421
  • 9
  • 90
  • 102
2

Here, we would likely use an expression that would pass optional spaces, just in case, such as:

(\s+)?(.+?)(\s+)?(years?|months?).*

Our desired output is in second capturing group:

(.+?)

and our code would look like

gsub("(\\s+)?(.+?)(\\s+)?(years?|months?).*", "\\2", this_str)

Demo

RegEx

If this expression wasn't desired and you wish to modify it, please visit this link at regex101.com.

RegEx Circuit

jex.im visualizes regular expressions:

enter image description here

Emma
  • 27,428
  • 11
  • 44
  • 69