R remove repeated digit sequences

Question

I am trying to remove all digits in a string except the first set of digits. So in other words, all repeating sets of digits, there could be 1 sets or 10+ sets in the string but I only want to keep the first set along with the rest of the string.

For example, the following string:

x <- 'foo123bar123baz123456abc1111def123456789'

The result would be:

foo123barbazabcdef

I am have tried using gsub and replacing \d+ with an empty string but this replaces all digits in the string, I have also tried using groups to capture some of the results but had no luck.

hwnd · Answer 1 · 2014-11-30T17:53:29.693

Using gsub you can use the \G feature, an anchor that can match at one of two positions.

x <- 'foo123bar123baz123456abc1111def123456789'
gsub('(?:\\d+|\\G(?<!^)\\D*)\\K\\d*', '', x, perl=T)
# [1] "foo123barbazabcdef"

Explanation:

(?:           # group, but do not capture:
  \d+         #   digits (0-9) (1 or more times)
 |            # OR
  \G(?<!^)    #   contiguous to a precedent match, not at the start of the string
  \D*         #   non-digits (all but 0-9) (0 or more times)
)\K           # end of grouping and reset the match from the result
\d*           # digits (0-9) (0 or more times)

Alternatively, you can use an optional group:

gsub('(?:^\\D*\\d+)?\\K\\d*', '', x, perl=T)

Another way that I find useful and does not require (*SKIP)(*F) backtracking verbs or the \G and \K feature is to use the alternation operator in context placing what you want to match in a capturing group on the left side and place what you want to exclude on the right side, (saying throw this away, it's garbage...)

gsub('^(\\D*\\d+)|\\d+', '\\1', x)

Not to pick on you but your second way won't work if the subject string begins with digits http://regex101.com/r/yW4aZ3/140, because the right side can swallow the left side you should reverse their order `^(\D*\d+)|\d+` and replace with `\1` http://regex101.com/r/yW4aZ3/141 - a variation from Avinash's solution, or `^\D*\d+\K|\d+` — alpha bravo, Nov 30 '14 at 16:54

score 3 · Answer 2 · edited May 23 '17 at 10:26

You could do this through PCRE verb (*SKIP)(*F).

^\D*\d+(*SKIP)(*F)|\d+

^\D*\d+ matches all the characters from the start upto the first number. (*SKIP)(*F) causes the match to fail and then the regex engine tries to match the characters using the pattern which was at the right side of | that is \d+ against the remaining string. Because (*SKIP)(*F) is a PCRE verb, you must need to enable perl=TRUE parameter.

DEMO

Code:

> x <- 'foo123bar123baz123456abc1111def123456789'
> gsub("^\\D*\\d+(*SKIP)(*F)|\\d+", "", x, perl=TRUE)
[1] "foo123barbazabcdef"

R remove repeated digit sequences

2 Answers2

Linked