1

Goal: Given a number (it may be very long and it is greater than 0), I'd like to get the five least meaningful digits dropping any 0 at the end of that number.

I tried to solve this with regex, Helped by RegexBuddy I came to this one:

[\d]+([\d]{0,4}+[1-9])0*

But python can't compile that.

>>> import re
>>> re.compile(r"[\d]+([\d]{0,4}+[1-9])0*")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.5/re.py", line 188, in compile
    return _compile(pattern, flags)
  File "/usr/lib/python2.5/re.py", line 241, in _compile
    raise error, v # invalid expression
sre_constants.error: multiple repeat

The problem is the "+" after "{0,4}", it seems it doesn't work in python (even in 2.6)

How can I write a working regex?

PS: I know you can start dividing by 10 and then using the remainder n%100000... but this is a problem about regex.

Andrea Ambu
  • 38,188
  • 14
  • 54
  • 77
  • 2
    I don't think the + should be there at all, actually. – Michael Myers Jun 15 '09 at 14:56
  • Try to do a replace with \1. Test it on RegexBuddy with a long enough number and you'll see the difference – Andrea Ambu Jun 15 '09 at 15:01
  • Ah, Blixt's answer mentions that the + is supposed to be a modifier to force the {0,4} to be greedy. I don't remember ever seeing that bit of syntax before--and apparently neither does Python. (In Java, it apparently makes the {0,4} "possessive" instead of greedy.) – Michael Myers Jun 15 '09 at 15:12
  • Ah that's true, it forces it to be more than greedy really. The + tells the engine to never back-track, as it would by default. So it's not really greedy. Possessive is the correct definition =) – Blixt Jun 15 '09 at 15:59

5 Answers5

10

That regular expression is very superfluous. Try this:

>>> import re
>>> re.compile(r"(\d{0,4}[1-9])0*$")

The above regular expression assumes that the number is valid (it will also match "abc0123450", for example.) If you really need the validation that there are no non-number characters, you may use this:

>>> import re
>>> re.compile(r"^\d*?(\d{0,4}[1-9])0*$")

Anyways, the \d does not need to be in a character class, and the quantifier {0,4} does not need to be forced to be greedy (as the additional + specifies, although apparently Python does not recognize that.)

Also, in the second regular expression, the \d is non-greedy, as I believe this will improve the performance and accuracy. I also made it "zero or more" as I assume that is what you want.

I also added anchors as this ensures that your regular expression won't match anything in the middle of a string. If this is what you desired though (maybe you're scanning a long text?), remove the anchors.

Blixt
  • 49,547
  • 13
  • 120
  • 153
5

\d{0,4}+ is a possessive quantifier supported by certain regular expression flavors such as .NET and Java. Python does not support possessive quantifiers.

In RegexBuddy, select Python in the toolbar at the top, and RegexBuddy will tell you that Python doesn't support possessive quantifiers. The + will be highlighted in red in the regular expression, and the Create tab will indicate the error.

If you select Python on the Use tab in RegexBuddy, RegexBuddy will generate a Python source code snippet with a regular expression without the possessive quantifier, and a comment indicating that the removal of the possessive quantifier may yield different results. Here's the Python code that RegexBuddy generates using the regex from the question:

# Your regular expression could not be converted to the flavor required by this language:
# Python does not support possessive quantifiers

# Because of this, the code snippet below will not work as you intended, if at all.

reobj = re.compile(r"[\d]+([\d]{0,4}[1-9])0*")

What you probably did is select a flavor such as Java in the main toolbar, and then click Copy Regex as Python String. That will give you a Java regular expression formatted as a Pythong string. The items in the Copy menu do not convert your regular expression. They merely format it as a string. This allows you to do things like format a JavaScript regular expression as a Python string so your server-side Python script can feed a regex into client-side JavaScript code.

Jan Goyvaerts
  • 21,379
  • 7
  • 60
  • 72
  • Oh well, there is a pretty old version at school, just downloaded the new one at home and there is the toolbar :D thanks! – Andrea Ambu Jun 17 '09 at 10:05
  • My response applies to RegexBuddy 3.0.0 and later. Version 3.0.0 was released on 13 June 2007. That's the first version that can emulate different regex flavors (currently 15). – Jan Goyvaerts Jun 17 '09 at 13:40
2

Small tip. I recommend you test with reTest instead of RegExBuddy. There are different regular expression engines for different programming languages. ReTest is valuable in that it allows you to quickly test regular expression strings within Python itself. That way you can insure that you tested your syntax with the Python's regular expression engine.

Jay Atkinson
  • 3,279
  • 2
  • 27
  • 41
  • Ultimately, any regular expression you use has to be tested in your actual application, on your actual data. Running initial tests in a tool like RegexBuddy while your regex is under construction saves you time, as long as the tool is used properly (in this case, select Python in RegexBuddy's toolbar when using Python). – Jan Goyvaerts Jun 16 '09 at 14:41
0

This is my solution.

re.search(r'[1-9]\d{0,3}[1-9](?=0*(?:\b|\s|[A-Za-z]))', '02324560001230045980a').group(1)

'4598'

  • [1-9] - the number must start with 1 - 9
  • \d{0,3} - 0 or 3 digits
  • [1-9] - the number must finish with 1 or 9
  • (?=0*(:?\b|\s\|[A-Za-z])) - the final part of string must be formed from 0 and or \b, \s, [A-Za-z]
memowe
  • 2,656
  • 16
  • 25
Badarau Petru
  • 29
  • 1
  • 2
  • 4
0

The error seems to be that you have two quantifiers in a row, {0,4} and +. Unless + is meant to be a literal here (which I doubt, since you're talking about numbers), then I don't think you need it at all. Unless it means something different in this situation (possibly the greediness of the {} quantifier)? I would try

[\d]+([\d]{0,4}[1-9])0*

If you actually intended to have both quantifiers to be applied, then this might work

[\d]+(([\d]{0,4})+[1-9])0*

But given your specification of the problem, I doubt that's what you want.

Sean
  • 4,450
  • 25
  • 22