3

So let's say I have a string

"Happy 2022 New 01 years!"

I'm looking to return the "01". To be more specific, I need the last set of digits in the string. This number could just be '1', or '10', or '999'... The string otherwise could be pretty much anything. I tried various regex with gsub, but can't seem to get it just right. There is something I misunderstood.

Eg, If I do this:

gsub('.*(\\d+).*$', '\\1', x)

Then why do I get back "1"? Does the '+' in the regex not specify one or more digits?

How is my interpretation wrong?: '.' for any characters, '(\\d+)' for one or more digits, '.'for some more characters, '$' at the end of the string. gsub is greedy, so it will return the last set of digits (therefore '01', not '2022'). '\\1' will replace the whole string with the first, and only, match. x is the string.

4 Answers4

4

In your regex, a .* will match all the characters(except the newline chars) and thus the whole string is matched. Then, the engine tries to match \d+ but there are no more characters left in the string to match. So, the back-tracking takes place into .* until a digit is found. Once a digit is found(i.e., 1 in your case), \d+ matches the digit and the rest of the string is again matched by .*.

You can try this regex:

\d+(?![^\r\n\d]*\d)

Click for Demo

Explanation:

  • \d+ - matches 1 or more digits, as many as possible
  • (?![^\r\n\d]*\d) - negative lookahead to make sure that there are no more digits later in the string
Gurmanjot Singh
  • 10,224
  • 2
  • 19
  • 43
  • Thank you for the solution and good explenation of that solution! I think I understand now. I would have expected the \d+ to also match for more then one digit in backtracking. There really is no difference then between \d and \d+ in this case? – Robbe Vankets Jan 04 '22 at 10:49
  • 1
    @RobbeVankets Yes. In your regex, if you'd have used, `\d` instead of `\d+`, the result would have been the same. – Gurmanjot Singh Jan 04 '22 at 10:52
  • So, this REGEX does the trick in the DEMO, however when trying to implement it in R, I get ''TRE pattern compilation error 'Invalid regexp'''. Any ideas why? (ofc I have first added the second escape characters required.) – Robbe Vankets Jan 04 '22 at 11:28
  • 1
    @RobbeVankets Not an expert in `R` but I think you need to set `perl=TRUE` as mentioned [here](https://stackoverflow.com/a/43458623/5331061). [Here](https://ideone.com/DBffAO) is the working code with minor tweaks in the regex – Gurmanjot Singh Jan 04 '22 at 12:08
3

Place word boundaries around the target final number:

x <- "Happy 2022 New 01 years!"
num <- gsub('.*\\b(\\d+)\\b.*$', '\\1', x)
num

[1] "01"

The challenge here is that we're tempted to use a lazy dot to stop at the first digit, e.g. .*?(\\d+).*. But the problem there is that now we will stop at the first number, though we want the last one. So, greedy dot is appropriate, and word boundaries forces the regex to capture the entire final number.

Tim Biegeleisen
  • 502,043
  • 27
  • 286
  • 360
  • Aha, thank you for your suggestion. This would however only work if spaces surround the final number. I cannot guarantee that, so I'm looking for a more defensive alternative. – Robbe Vankets Jan 04 '22 at 10:36
  • 1
    @RobbeVankets Then use `gsub('.*(?<=\\D)(\\d+).*$', '\\1', x, perl=TRUE)` ... I have answered the question you actually did ask. – Tim Biegeleisen Jan 04 '22 at 10:39
2

This could work:

(\d+)[^\d]*$

https://regex101.com/r/DHrttA/1

In your solution, I presume the problem is that the first .* is greedy, so it will jump over all it can.

KekuSemau
  • 6,830
  • 4
  • 24
  • 34
1

A workaround using strsplit

> tail(strsplit(x, "\\D+")[[1]], 1)
[1] "01"
ThomasIsCoding
  • 96,636
  • 9
  • 24
  • 81