-1

I'm trying to come up with a regex for dollar value search in Python. I have looked and tried lots of solutions on SO posts, but none of them is quite working.

The regex I came up with is:

[Ss]        # OCR will mess up with dollar signs, so I'm specifically looking for S and s as the starting of what I'm looking for
\d+         # any digits to start off
(,\d{3})*   # include comma for thousand splits, can have multiple commas
(.\d{2})?   # include dot and 2 decimals, but only one occurrence of this part

I have tried this on the following example:

t = "sixteen thousand three hundred and thirty dollars (s16,330.00)"
r = "[Ss]\d+(,\d{3})*(.\d{2})?"

re.findall(pattern=r, string=t)

And I got:

[(',330', '.00')]

Regex doc says that:

If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group. Empty matches are included in the result.

But it is not even getting the whole number part.

My question is: I really want to find s16,330.00 as a single piece. Is there a solution?

TYZ
  • 8,466
  • 5
  • 29
  • 60

3 Answers3

3

Remove capture groups to allow findall to return full matched string:

>>> t = "sixteen thousand three hundred and thirty dollars (s16,330.00)"
>>> r = r"[Ss]\d+(?:,\d{3})*(?:\.\d{2})?"
>>> re.findall(pattern=r, string=t)
['s16,330.00']

Also note that dot needs to be escaped in your regex

anubhava
  • 761,203
  • 64
  • 569
  • 643
  • I think the key was I was missing the "?:", is that for disabling subpattern matching? – TYZ Nov 27 '18 at 19:35
  • `(?:..)` is called non-capture group. Read more about it: https://www.regular-expressions.info/refcapture.html – anubhava Nov 27 '18 at 19:37
1

Use finditer:

import re

t = "sixteen thousand three hundred and thirty dollars (s16,330.00)"
r = "[Ss]\d+(,\d{3})*(.\d{2})?"

result = [match.group() for match in re.finditer(pattern=r, string=t)]
print(result)

Output

['s16,330.00']

The function finditer returns an iterator yielding match objects. The method group of a match object without arguments returns the whole match.

Dani Mesejo
  • 61,499
  • 6
  • 49
  • 76
0

Use a capturing group for the whole pattern, and non-capturing for subpatterns:

t = "sixteen thousand three hundred and thirty dollars (s16,330.00)"
re.findall(r"([Ss]\d+(?:,\d{3})*(?:.\d{2})?)", t)
['s16,330.00']

re.findall(pattern, string, flags=0)

Return all non-overlapping matches of pattern in string, as a list of strings. The string is scanned left-to-right, and matches are returned in the order found. If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group. Empty matches are included in the result.

https://docs.python.org/2/library/re.html#re.findall

mrzasa
  • 22,895
  • 11
  • 56
  • 94