-3

I need help finding a regex rule that should search in a large string/text and match numbers that have the format: 12,345,678 or 1,234,567 or 12,345 or 1,234.

for example: for 12,345,678 it should match 12,345,678 and not 345,678 or 45,678 or anything similar

I looked in: Regex help needed to match numbers but the answers either match 1 in 1,23,456 (should not at all because 1,23,456 is not a number) or match 23,456 in 12,23,456 (should not match at all)

In creating a regex rule to match the correct format number, I tried first creating the rule of what it should not match(i.e., not 1,23,456), then I tried creating the rule of what it should match. The last rule I created matches in most cases, but not in all.

number_regex1 = re.compile(r'''     # should not, but matches 12,233,57 = 12,233             
                          ((\d\d(?=[\s.,]\d\d\d))((?<=\d\d)[\s.,]\d\d\d)([\s.,]\d\d\d)+)| # matches 12,345,678
                          ((\d(?=[\s.,]\d\d\d))((?<=\d)([\s.,]\d\d\d))([\s.,]\d\d\d)+)| # matches 1,234,567
                          (((?<!\d[\s.,])(?<!\d)(?<!\d\d[\s.,])(?<!\d\d\d[\s.,])\d\d(?=[\s.,]\d\d\d))((?<=\d\d)[\s.,]\d\d\d))| # matches 12,345
                          (((?<!\d[\s.,])(?<!\d)(?<!\d\d[\s.,])(?<!\d\d\d[\s.,])\d(?![\s.,]\d\d)(?=[\s.,]\d\d\d))((?<=\d)[\s.,]\d\d\d))| # matches 1,234''', re.VERBOSE)

I want that when I do

mo = number_regex1.search('12,345,67') 

nothing is matched, as 12,345,67 is not a number

1 Answers1

1

You should use

re.findall(r'(?<!\d,)(?<!\d)\d{1,3}(?:,\d{3})*(?!,?\d)', text)

See the regex demo and the regex graph:

enter image description here

Details

  • (?<!\d,) - no digit and a comma is allowed immediately to the left of the current location
  • (?<!\d) - no digit is allowed immediately to the left of the current location
  • \d{1,3} - 1 to 3 digits
  • (?:,\d{3})* - 0 or more repetitions of , and 3 digits sequence
  • (?!,?\d) - no optional , and then a digit is allowed immediately to the right of the current location.

Note the two lookbehinds are required - (?<!\d,)(?<!\d) - as lookbehinds must be fixed-width ((?<!\d,|\d) won't work).

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • Nice. But, I do not understand why the **?** after the **,** in the last parentheses `(?!,?\d)` is so important( it avoids matching bad formats). – electricsheep Sep 03 '19 at 17:45
  • @puskini33 The `(?!,?\d)` is acting as a number boundary preventing the match if there is a `,` or just a digit. `(?!,?\d)` is the same as `(?!,\d)(?!\d)`. – Wiktor Stribiżew Sep 03 '19 at 17:47
  • When searching through texts it is indeed handy not to be too strict with your commas – Wilco Waaijer Sep 03 '19 at 17:50