Regex Price Matching

Question

I have a webscraper that scrapes prices, for that I need it to find following prices in strings:

762,50
1.843,75

In my first naive implementation, I didn't take the . into consideration and matched the first number with this regex perfectly:

re.findall("\d+,\d+", string)[0]

Now I need to match both cases and my initial idea was this:

re.findall("(\d+.\d+,\d+|\d+,\d+)", string)[0]

With an idea, that using the or operator, could find either the first or the second, which don't work, any suggestions?

Not in Denmark :) Its equilevant to one thousand and eight hundred tree Danish Kroner, and seventy five øre (equilevant to cents) — mrhn, Feb 25 '14 at 15:09
if you're trying to parse HTML with regex, see http://stackoverflow.com/questions/1732348 :) — isedev, Feb 25 '14 at 15:13

falsetru · Answer 1 · 2014-02-25T15:33:38.287

2

In regular expression, dot (.) matches any character (except newline unless DOTALL flag is not set). Escape it to match . literally:

\d+\.\d+,\d+|\d+,\d+
   ^^

To match multiple leading digits, the regular expression should be:

>>> re.findall(r'(?:\d+\.)*\d+,\d+', '1,23 1.843,75   123.456.762,50')
['1,23', '1.843,75', '123.456.762,50']

NOTE used non-capturing group because re.findall return a list of groups If one or more groups are present in the pattern.

UPDATE

>>> re.findall(r'(?<![\d.])\d{1,3}(?:\.\d{3})*,\d+',
...            '1,23 1.843,75   123.456.762,50  1.2.3.4.5.6.789,123')
['1,23', '1.843,75', '123.456.762,50']

edited Feb 25 '14 at 15:33

answered Feb 25 '14 at 15:08

falsetru

357,413
63
732
636

@C.B., Thank you for comment. BTW, `\.?` will not work for `1,75`. – falsetru Feb 25 '14 at 15:13
@C.B., I updated the answer according to your comment. – falsetru Feb 25 '14 at 15:16
This will match `1.2.3.4.5.6.789,123` – Toto Feb 25 '14 at 15:20

evuez · Accepted Answer · 2014-02-25T15:23:51.903

2

No need to use a or, just add the first part as an optional parameter:

(?:\d+\.)?\d+,\d+

The ? after (?:\d+\.) makes it an optional parameter. The '?:' indicate to not capture this group, just match it.

>>> re.findall(r'(?:\d+\.)?\d+,\d+', '1.843,75 762,50')
['1.843,75', '762,50']

Also note that you have to escape the . (dot) that would match any character except a newline (see http://docs.python.org/2/library/re.html#regular-expression-syntax)

edited Feb 25 '14 at 15:23

answered Feb 25 '14 at 15:13

evuez

3,257
4
29
44

You should note that if there's a group, `re.findall` does not contain part that is not captured by group. For instance, `re.findall(r'(\d+\.)?\d+,\d+', '1.843,75')` returns `['1.']`. – falsetru Feb 25 '14 at 15:16
Found that added parenthesis, made the right result be at spot [0], like this ((\d+\.)?\d+,\d+). Which i preferred, but great solution thanks! – mrhn Feb 25 '14 at 15:22
@mrhn, Using non-capturing group `(?:...)`, you don't need to surround the entire pattern with parentheses. – falsetru Feb 25 '14 at 15:24
I updated my answer with a non-capturing parenthesis for the first group, this way you will just have the list of price, not a tuple of the two groups, as mentionned by @falsetru – evuez Feb 25 '14 at 15:25

Corley Brigman · Answer 3 · 2014-02-25T15:24:30.190

In general, you have a set of zero or more XXX., followed by one or more XXX,, each up to 3 numbers, followed by two numbers (always). Do you want to also support numbers like 1,375 (without 'cents'?). You also need to avoid some false detection cases.

That looks like this:

matcher=r'((?:(?:(?:\d{1,3}\.)?(?:\d{3}.)*\d{3}\,)|(?:(?<![.0-9])\d{1,3},))\d\d)'

re.findall(matcher, '1.843,75     762,50')

This detects a lot of boundary cases, but may not catch everything....

score 0 · Answer 4 · answered Feb 25 '14 at 15:18

How about:

(\d+[,.]\d+(?:[.,]\d+)?)

Matches:

- some digits followed by , or . and some digits

OR

- some digits followed by , or . and some digits followed by , or . and some digits

It matches: 762,50 and 1.843,75 and 1,75

It will also match 1.843.75 are you OK with that?

See it in action.

score 0 · Answer 5 · answered Feb 25 '14 at 15:18

0

I'd use this:

\d{1,3}(?:\.\d{3})*,\d\d

This will match number that have dot as thousand separator

answered Feb 25 '14 at 15:18

Toto

89,455
62
89
125

score 0 · Answer 6 · answered Feb 25 '14 at 15:19

0

\d*\.?\d{3},\d{2}

See the working example here

answered Feb 25 '14 at 15:19

spoorcc

2,907
2
21
29

score 0 · Answer 7 · answered Feb 25 '14 at 15:22

This might be slower than regex, but given that the strings you are parsing are probably short, it should not matter.

Since the solution below does not use regex, it is simpler, and you can be more sure you are finding valid floats. Moreover, it parses the digit-strings into Python floats which is probably the next step you intend to perform anyway.

import locale
locale.setlocale(locale.LC_ALL, 'en_DK.UTF-8')

def float_filter(iterable):
    result = []
    for item in iterable:
        try:
            result.append(locale.atof(item))
        except ValueError:
            pass
    return result

text = 'The price is 762,50 kroner'

print(float_filter(text.split()))

yields

[762.5]

The basic idea: by setting a Danish locale, locale.atof parses commas as the decimal marker and dots as the grouping separator.

In [107]: import locale

In [108]: locale.setlocale(locale.LC_ALL, 'en_DK.UTF-8')
Out[108]: 'en_DK.UTF-8'

In [109]: locale.atof('762,50')
Out[109]: 762.5

In [110]: locale.atof('1.843,75')
Out[110]: 1843.75

Regex Price Matching

7 Answers7