Python - regex extract numbers from text that may contain thousands or millions separators and convert them to dot separated decimal floats

Question

I'm trying to extract 'valid' numbers from text that may or may not contain thousands or millions separators and decimals. The problem is that sometimes separators are ',' and in other cases are '.', the same applies for decimals. I should check if there is a posterior occurrence of ',' or '.' in order to automatically detect whether the character is a decimal or thousand separator in addition to condition \d{3}.

Another problem I have found is that there are dates in the text with format 'dd.mm.yyyy' or 'mm.dd.yy' that don't have to be matched.

The target is converting 'valid' numbers to float, I need to make sure is not a date, then remove millions/thousands separators and finally replace ',' for '.' when the decimal separator is ','.

I have read other great answers like Regular expression to match numbers with or without commas and decimals in text or enter link description here which solve more specific problems. I would be happy with something robust (don't need to get it in one regex command).

Here's what I've tried so far but the problem is well above my regex skills:

p = '\d+(?:[,.]\d{3})*(?:[.,]\d*)'
for s in ['blabla 1,25 10.587.256,25 euros', '6.010,12', '6.010', '6,010', '6,010.12', '6010,124', '05.12.2018', '12.05.18']:
     print(s, re.findall(p, s, re.IGNORECASE))

Does [this code](https://ideone.com/W1q8P8) do what you need? — Wiktor Stribiżew, Feb 21 '22 at 11:49
@WiktorStribiżew Thank you! yes it does! The only problem is that it matches dates with format dd.mm.yy or dd.mm.yyyy as well, can we skip them? — nopeva, Feb 21 '22 at 12:18
It matches but does nothing to them, isn't it what you want? Do nothing to those matches? Can you also confirm if `6010,124` must be replaced with `6010.124` or stay the same. — Wiktor Stribiżew, Feb 21 '22 at 12:19
Yes, ideally I would like to scan the text to find valid numbers only (without any text or dates) — nopeva, Feb 21 '22 at 12:22
My target is to extract them and convert them to float as you have already done — nopeva, Feb 21 '22 at 12:23

score 1 · Accepted Answer · answered Feb 21 '22 at 12:37

You can use

import re

p = r'\b\d{1,2}\.\d{1,2}\.\d{2}(?:\d{2})?\b|\b(?<!\d[.,])(\d{1,3}(?=([.,])?)(?:\2\d{3})*|\d+)(?:(?(2)(?!\2))[.,](\d+))?\b(?![,.]\d)'

def postprocess(x):
    if x.group(3): 
        return f"{x.group(1).replace(',','').replace('.','')}.{x.group(3)}"
    elif x.group(2):
        return f"{x.group(1).replace(',','').replace('.','')}"
    else:
        return None

texts = ['blabla 1,25 10.587.256,25 euros', '6.010,12', '6.010', '6,010', '6,010.12', '6010,124', '05.12.2018', '12.05.18']

for s in texts:
    print(s, '=>', list(filter(None, [postprocess(x) for x in re.finditer(p, s)])) )

Output:

blabla 1,25 10.587.256,25 euros => ['1.25', '10587256.25']
6.010,12 => ['6010.12']
6.010 => ['6010']
6,010 => ['6010']
6,010.12 => ['6010.12']
6010,124 => ['6010.124']
05.12.2018 => []
12.05.18 => []

The regex is

\b\d{1,2}\.\d{1,2}\.\d{2}(?:\d{2})?\b|\b(?<!\d[.,])(\d{1,3}(?=([.,])?)(?:\2\d{3})*|\d+)(?:(?(2)(?!\2))[.,](\d+))?\b(?![,.]\d)

Details:

\b\d{1,2}\.\d{1,2}\.\d{2}(?:\d{2})?\b| - matches a whole word, 1-2 digits, ., 1-2 digits, ., 2 or 4 digits (this match will be skipped)
\b - a word boundary
(?<!\d[.,]) - a negative lookbehind failing the match if there is a digit and a . or , immediately on the left
(\d{1,3}(?=([.,])?)(?:\2\d{3})*|\d+) - Group 1:
- \d{1,3} - one, two or three digits
- (?=([.,])?) - there must be an optional Group 2 capturing a . or , immediately on the right
- (?:\2\d{3})* - zero or more sequences of Group 2 value and then any three digits
- | - or
- \d+ - one or more digits
(?:(?(2)(?!\2))[.,](\d+))? - an optional sequence of
- (?(2)(?!\2)) - if Group 2 matched, the next char cannot be Group 2 value
- [.,] - a comma or dot
- (\d+) - Group 3: one or more digits
\b - a word boundary
(?![,.]\d) - a negative lookahead failing the match if there is a , or . and a digit immediately on the right.

The postprocess method returns None if no capturing group matched, or a number with no commas or dots in the integer part.

Python - regex extract numbers from text that may contain thousands or millions separators and convert them to dot separated decimal floats

1 Answers1