I am matching digits in str using a regex in Python. My desire is to capture numbers that might have a thousands separator (for me, a comma or a space) or can just be a string of numbers. The following shows what my regex captures
>>> import re
>>> test = '3,254,236,948,348.884423 cold things, ' + \
'123,242 falling birds, .84973 of a French pen , ' + \
'65 243 turtle gloves, 8 001 457.2328009 units, and ' + \
'8d523c.'
>>> matches = re.finditer(ANY_NUMBER_SRCH, test, flags=re.MULTILINE)
>>> for match in matches:
... print (str(match))
...
<_sre.SRE_Match object; span=(0, 24), match='3,254,236,948,348.884423'>
<_sre.SRE_Match object; span=(27, 34), match='123,242'>
<_sre.SRE_Match object; span=(37, 43), match='.84973'>
<_sre.SRE_Match object; span=(46, 52), match='65 243'>
<_sre.SRE_Match object; span=(55, 72), match='8 001 457.2328009'>
<_sre.SRE_Match object; span=(73, 74), match='8'>
<_sre.SRE_Match object; span=(75, 78), match='523'>
This is the matching behavior I want. Now, I want to take each of the matched numbers and remove the thousands separators (','
or ' '
) if they exist. This should leave me with
'3254236948348.884423 cold things, ' + \
'123242 falling birds, .84973 of a French pen ,' + \
'65243 turtle gloves, 8001457.2328009 units, ' + \
'and 8d523c.'
Basically, I have one regex to capture the number. This regex is used in multiple places, e.g. to find dollar amounts, to get ordinal numbers, ... For this reason, I've named the regex, ANY_NUMBER_SRCH
.
What I want to do is something like the following:
matches = some_method_to_get_all_matches(ANY_NUMBER_SRCH)
for match in matches:
corrected_match = re.sub(r"[, ]", "", match)
change_match_to_corrected_match_in_the_test_string
As things are, I can not use substitution groups. If you just want to see the regex, you can check out https://regex101.com/r/AzChEE/3 Basically, part of my regex is as follows
r"(?P<whole_number_w_thous_sep>(?P<first_group>\d{1,3})(?P<thousands_separator>[ ,])(?P<three_digits_w_sep>(?P<three_digits>\d{3})(?P=thousands_separator))*(?P<last_group_of_three>\d{3})(?!\d)"
I'll represent that without the "scrolling line":
(r"(?P<whole_number_w_thous_sep>(?P<first_group>\d{1,3})"
"(?P<thousands_separator>[ ,])"
"(?P<three_digits_w_sep>(?P<three_digits>\d{3})"
"(?P=thousands_separator))*"
"(?P<last_group_of_three>\d{3})(?!\d)")
The regex engine doesn't keep the repeated three_digits_with_separator
because of the *
for repeated capturing groups.
I'm sure there's a way to use the span
parts of the _sre.SRE_Match object
s. That would be quite involved, however, and I'm dealing with strings with thousands to hundreds-of-thousands of characters. Is there a simple way to do re.sub
after the re.match
or re.iter
or whichever other method is used to find the number pattern?
@abarnert got me the right answer - using a lambda function. My comment under @abarnert's answer, beginning with 'Verified!' shows all the steps. Just in case that comment goes the way of the broken links, .
My Attempts
By the way, I have looked at these questions on SO (replace portion of match, extract part of a match, replace after matching pattern, repeated capturing group stuff), but they simply show how to use substitution groups. I've also tried to use re.finditer
as shown below with the following result.
>>> matches = re.finditer(lib_re.ANY_NUMBER_SRCH, test, flags=re.MULTILINE)
>>> for match in matches:
... print ("match: " + str(match))
... corrected_match = re.sub(r"[, ]", "", match)
... print ("corrected_match: " + str(corrected_match))
...
match: <_sre.SRE_Match object; span=(0, 24), match='3,254,236,948,348.884423'>
Traceback (most recent call last):
File "<stdin>", line 3, in <module>
File "/usr/lib/python3.6/re.py", line 191, in sub
return _compile(pattern, flags).sub(repl, string, count)
TypeError: expected string or bytes-like object
>>> print ("corrected_match: " + str(corrected_match))
The Big Regex
In case something happens to the regex101.com link, here is the giant regex:
ANY_NUMBER_SRCH = r"(?P<number_capture>(?P<pre1>(?<![^0-9,.+-])|)(?P<number>(?P<sign_symbol_opt1>(?<![0-9])[+-])?(?P<whole_number_w_thous_sep>(?P<first_group>\d{1,3})(?P<thousands_separator>[ ,])(?P<three_digits_w_sep>(?P<three_digits>\d{3})(?P=thousands_separator))*(?P<last_group_of_three>\d{3})(?!\d)|(?P<whole_number_w_o_thous_sep>\d+))(?P<decimal_separator_1>[.])?(?P<fractional_w_whole_before>(?<=[.])(?P<digits_after_decimal_sep_1>\d+))?(?P<post1>(?<![^0-9,.+-])|)|(?P<pre2>(?<![^0-9,.+-])|)(?P<fractional_without_whole_before>(?P<sign_symbol_opt2>(?<![0-9])[+-])?(?P<decimal_separator_2>[.])(?P<digits_after_decimal_sep_2>\d+)))(?P<post2>(?<![^0-9,.+-])|))"