I'm trying to extract 'valid' numbers from text that may or may not contain thousands or millions separators and decimals. The problem is that sometimes separators are ',' and in other cases are '.', the same applies for decimals. I should check if there is a posterior occurrence of ',' or '.' in order to automatically detect whether the character is a decimal or thousand separator in addition to condition \d{3}
.
Another problem I have found is that there are dates in the text with format 'dd.mm.yyyy' or 'mm.dd.yy' that don't have to be matched.
The target is converting 'valid' numbers to float
, I need to make sure is not a date, then remove millions/thousands separators and finally replace ',' for '.' when the decimal separator is ','.
I have read other great answers like Regular expression to match numbers with or without commas and decimals in text or enter link description here which solve more specific problems. I would be happy with something robust (don't need to get it in one regex command).
Here's what I've tried so far but the problem is well above my regex skills:
p = '\d+(?:[,.]\d{3})*(?:[.,]\d*)'
for s in ['blabla 1,25 10.587.256,25 euros', '6.010,12', '6.010', '6,010', '6,010.12', '6010,124', '05.12.2018', '12.05.18']:
print(s, re.findall(p, s, re.IGNORECASE))