2

I have a webscraper that scrapes prices, for that I need it to find following prices in strings:

  • 762,50
  • 1.843,75

In my first naive implementation, I didn't take the . into consideration and matched the first number with this regex perfectly:

re.findall("\d+,\d+", string)[0]

Now I need to match both cases and my initial idea was this:

re.findall("(\d+.\d+,\d+|\d+,\d+)", string)[0]

With an idea, that using the or operator, could find either the first or the second, which don't work, any suggestions?

spoorcc
  • 2,907
  • 2
  • 21
  • 29
mrhn
  • 17,961
  • 4
  • 27
  • 46

7 Answers7

2

In regular expression, dot (.) matches any character (except newline unless DOTALL flag is not set). Escape it to match . literally:

\d+\.\d+,\d+|\d+,\d+
   ^^

To match multiple leading digits, the regular expression should be:

>>> re.findall(r'(?:\d+\.)*\d+,\d+', '1,23 1.843,75   123.456.762,50')
['1,23', '1.843,75', '123.456.762,50']

NOTE used non-capturing group because re.findall return a list of groups If one or more groups are present in the pattern.

UPDATE

>>> re.findall(r'(?<![\d.])\d{1,3}(?:\.\d{3})*,\d+',
...            '1,23 1.843,75   123.456.762,50  1.2.3.4.5.6.789,123')
['1,23', '1.843,75', '123.456.762,50']
falsetru
  • 357,413
  • 63
  • 732
  • 636
2

No need to use a or, just add the first part as an optional parameter:

(?:\d+\.)?\d+,\d+

The ? after (?:\d+\.) makes it an optional parameter. The '?:' indicate to not capture this group, just match it.

>>> re.findall(r'(?:\d+\.)?\d+,\d+', '1.843,75 762,50')
['1.843,75', '762,50']

Also note that you have to escape the . (dot) that would match any character except a newline (see http://docs.python.org/2/library/re.html#regular-expression-syntax)

evuez
  • 3,257
  • 4
  • 29
  • 44
  • You should note that if there's a group, `re.findall` does not contain part that is not captured by group. For instance, `re.findall(r'(\d+\.)?\d+,\d+', '1.843,75')` returns `['1.']`. – falsetru Feb 25 '14 at 15:16
  • Found that added parenthesis, made the right result be at spot [0], like this ((\d+\.)?\d+,\d+). Which i preferred, but great solution thanks! – mrhn Feb 25 '14 at 15:22
  • @mrhn, Using non-capturing group `(?:...)`, you don't need to surround the entire pattern with parentheses. – falsetru Feb 25 '14 at 15:24
  • I updated my answer with a non-capturing parenthesis for the first group, this way you will just have the list of price, not a tuple of the two groups, as mentionned by @falsetru – evuez Feb 25 '14 at 15:25
0

In general, you have a set of zero or more XXX., followed by one or more XXX,, each up to 3 numbers, followed by two numbers (always). Do you want to also support numbers like 1,375 (without 'cents'?). You also need to avoid some false detection cases.

That looks like this:

matcher=r'((?:(?:(?:\d{1,3}\.)?(?:\d{3}.)*\d{3}\,)|(?:(?<![.0-9])\d{1,3},))\d\d)'

re.findall(matcher, '1.843,75     762,50')

This detects a lot of boundary cases, but may not catch everything....

Corley Brigman
  • 11,633
  • 5
  • 33
  • 40
0

How about:

(\d+[,.]\d+(?:[.,]\d+)?)

Matches:

- some digits followed by , or . and some digits

OR

- some digits followed by , or . and some digits followed by , or . and some digits

It matches: 762,50 and 1.843,75 and 1,75

It will also match 1.843.75 are you OK with that?

See it in action.

e h
  • 8,435
  • 7
  • 40
  • 58
0

I'd use this:

\d{1,3}(?:\.\d{3})*,\d\d

This will match number that have dot as thousand separator

Toto
  • 89,455
  • 62
  • 89
  • 125
0
\d*\.?\d{3},\d{2}

See the working example here

spoorcc
  • 2,907
  • 2
  • 21
  • 29
0

This might be slower than regex, but given that the strings you are parsing are probably short, it should not matter.

Since the solution below does not use regex, it is simpler, and you can be more sure you are finding valid floats. Moreover, it parses the digit-strings into Python floats which is probably the next step you intend to perform anyway.

import locale
locale.setlocale(locale.LC_ALL, 'en_DK.UTF-8')

def float_filter(iterable):
    result = []
    for item in iterable:
        try:
            result.append(locale.atof(item))
        except ValueError:
            pass
    return result

text = 'The price is 762,50 kroner'

print(float_filter(text.split()))

yields

[762.5]

The basic idea: by setting a Danish locale, locale.atof parses commas as the decimal marker and dots as the grouping separator.

In [107]: import locale

In [108]: locale.setlocale(locale.LC_ALL, 'en_DK.UTF-8')
Out[108]: 'en_DK.UTF-8'

In [109]: locale.atof('762,50')
Out[109]: 762.5

In [110]: locale.atof('1.843,75')
Out[110]: 1843.75
unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677