3

I'm trying to replace commas for cases like:

123,123  

where the output should be:

123123

for that I tried this:

re.sub('\d,\d','','123,123')

but that is also deleting the the digits, how can avoid this?

I only want to remode the comma for that case in particular, that's way I'm using regex. For this case, e.g.

'123,123  hello,word'

The desired output is:

'123123  hello,word'
Luis Ramon Ramirez Rodriguez
  • 9,591
  • 27
  • 102
  • 181
  • See also the answers to [convert decimal mark](https://stackoverflow.com/q/7106417/2314737), a typical use case for such substitution. – user2314737 Dec 01 '17 at 10:30

2 Answers2

12

You can use regex look around to restrict the comma (?<=\d),(?=\d); use ?<= for look behind and ?= for look ahead; They are zero length assertions and don't consume characters so the pattern in the look around will not be removed:

import re

re.sub('(?<=\d),(?=\d)', '', '123,123  hello,word')
# '123123  hello,word'
Psidom
  • 209,562
  • 33
  • 339
  • 356
2

This is one of the cases where you want regular expression "lookaround assertions" ... which have zero length (pattern capture semantics).

Doing so allows you to match cases which would otherwise be "overlapping" in your substitution.

Here's an example:

#!python
import re
num = '123,456,7,8.012,345,6,7,8'
pattern = re.compile(r'(?<=\d),(?=\d)')
pattern.sub('',num)
# >>> '12345678.012345678'

... note that I'm using re.compile() to make this more readable and also because that usage pattern is likely to perform better in many cases. I'm using the same regular expression as @Psidom; but I'm using a Python 'raw' string which is more commonly the way to express regular expressions in Python.

I'm deliberately using an example where the spacing of the commas would overlap if I were using a regular expression such as; re.compile(r'(\d),(\d)') and trying to substitute using back references to the captured characters pattern.sub(r'\1\2', num) ... that would work for many examples; but '1,2,3' would not match because the capturing causes them to be overlapping.

This one of the main reasons that these "lookaround" (lookahead and lookbehind) assertions exist ... to avoid cases where you'd have to repeatedly/recursively apply a pattern due to capture and overlap semantics. These assertions don't capture, they match "zero" characters (as with some PCRE meta patterns like \b ... which matches the zero length boundary between words rather than any of the whitespace (\s which or non-"word" (\W) characters which separate words).

Jim Dennis
  • 17,054
  • 13
  • 68
  • 116
  • By the way, if you're actually trying to handle strings containing localized number representations (https://en.wikipedia.org/wiki/Decimal_mark) then you might want to see this [StackOverflow: How do I use Python to convert a string to a number if it has commas in it as thousands separators?](https://stackoverflow.com/a/1779324/149076). – Jim Dennis Sep 05 '17 at 03:57