1

I am matching digits in str using a regex in Python. My desire is to capture numbers that might have a thousands separator (for me, a comma or a space) or can just be a string of numbers. The following shows what my regex captures

>>> import re
>>> test = '3,254,236,948,348.884423 cold things, ' + \
'123,242 falling birds, .84973 of a French pen , ' + \
'65 243 turtle gloves, 8 001 457.2328009 units, and ' + \
'8d523c.'
>>> matches = re.finditer(ANY_NUMBER_SRCH, test, flags=re.MULTILINE)
>>> for match in matches:
...   print (str(match))
...
<_sre.SRE_Match object; span=(0, 24), match='3,254,236,948,348.884423'>
<_sre.SRE_Match object; span=(27, 34), match='123,242'>
<_sre.SRE_Match object; span=(37, 43), match='.84973'>
<_sre.SRE_Match object; span=(46, 52), match='65 243'>
<_sre.SRE_Match object; span=(55, 72), match='8 001 457.2328009'>
<_sre.SRE_Match object; span=(73, 74), match='8'>
<_sre.SRE_Match object; span=(75, 78), match='523'>

This is the matching behavior I want. Now, I want to take each of the matched numbers and remove the thousands separators (',' or ' ') if they exist. This should leave me with

'3254236948348.884423 cold things, ' + \
'123242 falling birds, .84973 of a French pen ,' + \
'65243 turtle gloves, 8001457.2328009 units, ' + \
'and 8d523c.'

Basically, I have one regex to capture the number. This regex is used in multiple places, e.g. to find dollar amounts, to get ordinal numbers, ... For this reason, I've named the regex, ANY_NUMBER_SRCH.

What I want to do is something like the following:

matches = some_method_to_get_all_matches(ANY_NUMBER_SRCH)
for match in matches:
  corrected_match = re.sub(r"[, ]", "", match)
  change_match_to_corrected_match_in_the_test_string

As things are, I can not use substitution groups. If you just want to see the regex, you can check out https://regex101.com/r/AzChEE/3 Basically, part of my regex is as follows

r"(?P<whole_number_w_thous_sep>(?P<first_group>\d{1,3})(?P<thousands_separator>[ ,])(?P<three_digits_w_sep>(?P<three_digits>\d{3})(?P=thousands_separator))*(?P<last_group_of_three>\d{3})(?!\d)"

I'll represent that without the "scrolling line":

(r"(?P<whole_number_w_thous_sep>(?P<first_group>\d{1,3})"
  "(?P<thousands_separator>[ ,])"
  "(?P<three_digits_w_sep>(?P<three_digits>\d{3})"
  "(?P=thousands_separator))*"
  "(?P<last_group_of_three>\d{3})(?!\d)")

The regex engine doesn't keep the repeated three_digits_with_separator because of the * for repeated capturing groups.

I'm sure there's a way to use the span parts of the _sre.SRE_Match objects. That would be quite involved, however, and I'm dealing with strings with thousands to hundreds-of-thousands of characters. Is there a simple way to do re.sub after the re.match or re.iter or whichever other method is used to find the number pattern?

@abarnert got me the right answer - using a lambda function. My comment under @abarnert's answer, beginning with 'Verified!' shows all the steps. Just in case that comment goes the way of the broken links, here's an image of the comment, with all the solution steps..


My Attempts

By the way, I have looked at these questions on SO (replace portion of match, extract part of a match, replace after matching pattern, repeated capturing group stuff), but they simply show how to use substitution groups. I've also tried to use re.finditer as shown below with the following result.

>>> matches = re.finditer(lib_re.ANY_NUMBER_SRCH, test, flags=re.MULTILINE)     
>>> for match in matches:
...   print ("match: " + str(match))
...   corrected_match = re.sub(r"[, ]", "", match)
...   print ("corrected_match: " + str(corrected_match))
...
match: <_sre.SRE_Match object; span=(0, 24), match='3,254,236,948,348.884423'>
Traceback (most recent call last):
  File "<stdin>", line 3, in <module>
  File "/usr/lib/python3.6/re.py", line 191, in sub
    return _compile(pattern, flags).sub(repl, string, count)
TypeError: expected string or bytes-like object
>>>   print ("corrected_match: " + str(corrected_match))

The Big Regex

In case something happens to the regex101.com link, here is the giant regex:

ANY_NUMBER_SRCH = r"(?P<number_capture>(?P<pre1>(?<![^0-9,.+-])|)(?P<number>(?P<sign_symbol_opt1>(?<![0-9])[+-])?(?P<whole_number_w_thous_sep>(?P<first_group>\d{1,3})(?P<thousands_separator>[ ,])(?P<three_digits_w_sep>(?P<three_digits>\d{3})(?P=thousands_separator))*(?P<last_group_of_three>\d{3})(?!\d)|(?P<whole_number_w_o_thous_sep>\d+))(?P<decimal_separator_1>[.])?(?P<fractional_w_whole_before>(?<=[.])(?P<digits_after_decimal_sep_1>\d+))?(?P<post1>(?<![^0-9,.+-])|)|(?P<pre2>(?<![^0-9,.+-])|)(?P<fractional_without_whole_before>(?P<sign_symbol_opt2>(?<![0-9])[+-])?(?P<decimal_separator_2>[.])(?P<digits_after_decimal_sep_2>\d+)))(?P<post2>(?<![^0-9,.+-])|))"
bballdave025
  • 1,347
  • 1
  • 15
  • 28
  • 1
    Meanwhile, what did you expect `re.sub(r"[, ]", "", match)` to do? You can only call that on a string, not a match object. And, even if you fix that, once you have `corrected_match`, what are you going to do with it? Obviously nothing you do to create a new string is going to affect `test` in any way. – abarnert Aug 01 '18 at 19:06
  • I wasn't expecting the `re.sub` to work on `match`. I wasn't sure how to do the substitution there--that was my question, @abarnert. I wasn't sure how to get the stuff from the matches. Your answer with the lambda functions does exactly what I want. Thanks so much for answering a question that was imperfect. What you did does affect test, which is what I wanted. – bballdave025 Aug 02 '18 at 15:27
  • Well, it allows me to affect test. @abarnert, I've accepted your answer. I'm now trying to implement it to also replace the spaces in a number like, '8 001 457.2328009', which should go to '8001457.2328009'. – bballdave025 Aug 02 '18 at 15:39

1 Answers1

1

I don't see any reason you can't just use re.sub instead of re.finditer here. Your repl gets applied once for each match, and the result of substituting each pattern with repl in string is returned, which is exactly what you want.

I can't actually run your example, because copying and pasting test gives me a SyntaxError, and copying and pasting ANY_NUMBER_SRCH gives me an error compiling the regex, and I don't want to go down a rabbit hole trying to fix all of your bugs, most of which probably aren't even in your real code. So let me give a simpler example:

>>> test = '3,254,236,948,348.884423 cold things and 8d523c'
>>> pattern = re.compile(r'[\d,]+')
>>> pattern.findall(test) # just to verify that it works
['3,254,236,948,348', '884423', '8', '523']
>>> pattern.sub(lambda match: match.group().replace(',', ''), test)
'3254236948348.884423 cold things and 8d523c'

Obviously your repl function will be a bit more complicated than just removing all of the commas—and you'll probably want to def it out-of-line rather than try to cram it into a lambda. But whatever your rule is, if you an write it as a function that takes a match object and returns the string you want in place of that match object, you can just pass that function to sub.

abarnert
  • 354,177
  • 51
  • 601
  • 671
  • I apologize for not double-checking my question. I tried to make things look good in the `test` string, but it made it not copy/paste-able. I've made that so it works on my machine, and hopefully on others as well. I had missed a '`P`' before '``' when I copy/pasted from regex101.com -- I added the named groups in an attempt to clarify things. Thanks for "reading through" these errors and giving me the answer I needed! – bballdave025 Aug 02 '18 at 15:42
  • Verified! `#ANY_NUMBER_SRCH as in question;` ; `#test as in question;` ; `>>>pattern=re.compile(ANY_NUMBER_SRCH);` ; `test_corrected=pattern.sub(lambda match: match.group().replace(',', '').replace(' ', ''), test);` ; `>>>test_corrected;` #result# `'3254236948348.884423 cold things, 123242 falling birds, .84973 of a French pen , 65243 turtle gloves, 8001457.2328009 units, and 8d523c.'` ; Just as required. – bballdave025 Aug 02 '18 at 15:54