1

I have some errors in some numbers showing numbers like "59.34343.23". I know the first dot is correct but the second one (or any after the first) should be remove. How can I remove those?

I tried using gsub in R:

gsub("(?<=\\..*)\\.", "", "59.34343.23", perl=T)

or

gsub("(?<!^[^.]*)\\.", "", "59.34343.23", perl=T)

However it gets the following error "invalid regular expression". But I have been trying the same code in a regex tester and it works. What is my mistake here?

user2246905
  • 1,029
  • 1
  • 12
  • 31
  • If you're feeling a bit lazy you could reverse the string, convert each period that is followed by a period later in the string to an empty string, and then reverse the resulting string, using `\.(?=.*\.)` for step two, `(?=.*\.)` being a *positive lookahead*. Another lazy option is to split the string on periods to produce an array of `n` strings, replace the first two with them joined together with a period between them, then join the resulting `n-1` strings. – Cary Swoveland Oct 07 '21 at 18:21
  • I presume that the first character cannot be a period, but you may wish to clarify that as it affects the solutions. – Cary Swoveland Oct 07 '21 at 18:41

5 Answers5

3

You can use

gsub("^([^.]*\\.)|\\.", "\\1", "59.34343.23")
gsub("^([^.]*\\.)|\\.", "\\1", "59.34343.23", perl=TRUE)

See the R demo online and the regex demo.

Details:

  • ^([^.]*\.) - Capturing group 1 (referred to as \1 from the replacement pattern): any zero or more chars from the start of string and then a . char (the first in the string)
  • | - or
  • \. - any other dot in the string.

Since the replacement, \1, refers to Group 1, and Group 1 only contains a value after the text before and including the first dot is matched, the replacement is either this part of text, or empty string (i.e. the second and all subsequent occurrences of dots are removed).

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
1

We may use

gsub("^[^.]+\\.(*SKIP)(*FAIL)|\\.", "", str1, perl = TRUE)
[1] "59.3434323"

data

str1 <-  "59.34343.23"
akrun
  • 874,273
  • 37
  • 540
  • 662
1

By specifying perl = TRUE you can convert matches of the following regular expression to empty strings:

^[^.]*\.[^.]*\K.|\.

Start your engine!

If you are unfamiliar with \K hover over it in the regular expression at the link to see an explanation of its effect.

Cary Swoveland
  • 106,649
  • 6
  • 63
  • 100
1

There is always the option to only write back the dot if its the first in the line.
Key feature is to consume the other dots but don't write it back.
Effect is to delete trailing dots.

Below uses a branch reset to accomplish the goal (Perl mode).

(?m)(?|(^[^.\n]*\.)|()\.+)

Replace $1

https://regex101.com/r/cHcu4j/1

 (?m)
 (?|
    ( ^ [^.\n]* \. )              # (1)
  | ( )                           # (1)
    \.+ 
 )
sln
  • 2,071
  • 1
  • 3
  • 11
  • 1
    Very nice! I need to go back to school. I've never heard of a *branch reset group*. – Cary Swoveland Oct 07 '21 at 20:46
  • @CarySwoveland - Its just for demonstration purposes. Many other ways to do the same thing. The branch reset wasn't being done so I threw it up .. – sln Oct 08 '21 at 23:46
  • 1
    As in [regurgitated](https://stackoverflow.com/questions/5319840/greedy-vs-reluctant-vs-possessive-qualifiers?rq=1), though your formalism is clearly different. I'm just trying to sweat my way through regex kindergarten, but honing in on who to follow certainly helps. – Chris Oct 08 '21 at 23:59
  • @Chris - There are many ways to do the same thing. But don't despair, regex is predictably easy once the scope of the problem is understood. Sometimes there is too many options. Once you hit that stage it becomes about performance, where aesthetics should be a final option. If you can go past that, text processing really opens up. – sln Oct 14 '21 at 01:49
1

The pattern that you tried does not match, because there is an infinite quantifier in the lookbehind (?<=\\..*) that is not supported.

Another variation using \G to get continuous matches after the first dot:

(?:^[^.]*\.|\G(?!^))[^.]*\K\.

In parts, the pattern matches:

  • (?: Non capture group for the alternation |
    • ^[^.]*\. Start of string, match any char except ., then match .
    • | Or
    • \G(?!^) Assert the position at the end of the previous match (not at the start)
  • )[^.]* Optionally match any char except .
  • \K\. Clear the match buffer an match the dot (to be removed)

Regex demo | R demo

gsub("(?:^[^.]*\\.|\\G(?!^))[^.]*\\K\\.", "", "59.34343.23", perl=T)

Output

[1] "59.3434323"
The fourth bird
  • 154,723
  • 16
  • 55
  • 70