Regular expression to match a Python integer literal

Question

What is a regular expression that will match any valid Python integer literal in a string? It should support all the extra stuff like o and l, but not match a float, or a variable with a number in it. I am using Python's re, so any syntax supported by that is OK.

EDIT: Here's my motivation (as apparently that's quite important). I am trying to fix http://code.google.com/p/sympy/issues/detail?id=3182. What I want to do is create a hook for IPython that automatically converts int/int (like 1/2) to Rational(int, int), (like Rational(1, 2). The reason is that otherwise it is impossible to make 1/2 be registered as a rational number, because it's Python type __div__ Python type. In SymPy, this can be quite annoying because things like x**(1/2) will create x**0 (or x**0.5 with __future__ division or Python 3), when what you want is x**Rational(1, 2), an exact quantity.

My solution is to add a hook to IPython that automatically wraps all integer literals in the input with Integer (SymPy's custom integer class that gives Rational on division). This will let me add an option to isympy that will let SymPy act more like a traditional computer algebra system in this respect, for those who want it. I hope this explains why I need it to match any and all literals inside an arbitrary Python expression, which is why it needs to not match float literals and variables with numbers in their names.

Also, since everyone's so interested in what I tried, here it is: not much before I gave up (regular expressions are hard). I played with (?!\.) to make it not catch the first part of float literals, but this didn't seem to work (I'd be curious if someone can tell me why, an example is re.sub(r"(\d*(?!\.))", r"S$\1$", "12.1")).

EDIT 2: Since I plan to use this in conjunction with re.sub, you might as well wrap the whole thing in parentheses in your answers so I can use \1 :)

Everything you need to know is in the [Python Docs](http://python.org/doc/) — Joel Cornett, Jul 31 '12 at 05:10
I did do my research. I googled for it, and even tried it myself. It got me nowhere. I didn't include that in the question because I didn't feel it was relevant. — asmeurer, Jul 31 '12 at 05:10
And considering that none of the answers so far do what I want, I'd say it's not a trivial problem. — asmeurer, Jul 31 '12 at 05:11
@asmeurer usually best to post your wrong/incomplete solution (in the question) than nothing purely for this reason. Also, mentioning why you want to do something along with the rest of the question can be handy, because there may be other solutions you didn't expect that are better than the one asked for. — Josh Smeaton, Jul 31 '12 at 05:21
I agree with @JoshSmeaton. Sorry if I was a little rude. If you edit your question, I can reverse my downvote. — Joel Cornett, Jul 31 '12 at 05:28

Danica · Answer 1 · 2012-07-31T05:02:00.050

The definition of the integer literal is (in 3.x, slightly different in 2.x):

integer        ::=  decimalinteger | octinteger | hexinteger | bininteger
decimalinteger ::=  nonzerodigit digit* | "0"+
nonzerodigit   ::=  "1"..."9"
digit          ::=  "0"..."9"
octinteger     ::=  "0" ("o" | "O") octdigit+
hexinteger     ::=  "0" ("x" | "X") hexdigit+
bininteger     ::=  "0" ("b" | "B") bindigit+
octdigit       ::=  "0"..."7"
hexdigit       ::=  digit | "a"..."f" | "A"..."F"
bindigit       ::=  "0" | "1"

So, something like this:

[1-9]\d*|0|0[oO][0-7]+|0[xX][\da-fA-F]+|0[bB][01]+

Based on saying you want to support "l", I guess you actually want the 2.x definition:

longinteger    ::=  integer ("l" | "L")
integer        ::=  decimalinteger | octinteger | hexinteger | bininteger
decimalinteger ::=  nonzerodigit digit* | "0"
octinteger     ::=  "0" ("o" | "O") octdigit+ | "0" octdigit+
hexinteger     ::=  "0" ("x" | "X") hexdigit+
bininteger     ::=  "0" ("b" | "B") bindigit+
nonzerodigit   ::=  "1"..."9"
octdigit       ::=  "0"..."7"
bindigit       ::=  "0" | "1"
hexdigit       ::=  digit | "a"..."f" | "A"..."F"

which can be written

(?:[1-9]\d+|0|0[oO]?[0-7]+|0[xX][\da-fA-F]+|0[bB][01]+)[lL]?

This still matches the first part of float literals and the number part of variables that contain numbers. — asmeurer, Jul 31 '12 at 05:05
I haven't written it yet, but it looks like the decimal example from the Python docs is almost exactly what I want. — asmeurer, Aug 03 '12 at 22:17

ruakh · Answer 2 · 2012-07-31T12:18:17.133

4

The syntax is described at http://docs.python.org/reference/lexical_analysis.html#integers. Here's one way to express it as a regex:

(0|[1-9][0-9]*|0[oO]?[0-7]+|0[xX][0-9a-fA-F]+|0[bB][01]+)[lL]?

Disclaimer: this does not support negative integers, because in Python, the - in something like -31 isn't actually part of the integer literal, but rather, it's a separate operator.

edited Jul 31 '12 at 12:18

answered Jul 31 '12 at 04:53

ruakh

175,680
26
273
307

Missing the format for e.g. `0755` as a hex literal; also requires the `[lL]` on the end right now. – Danica Jul 31 '12 at 04:59
It's OK if the `-` is separate. It will still work out fine for what I am doing. – asmeurer Jul 31 '12 at 05:11
Hmmm interesting point about the `-`. Now that I think about it, it makes sense that it would be a separate operator. – Joel Cornett Jul 31 '12 at 05:37
@Dougal: In other words, I was missing both instances of `?`. Dunno how that happened. Thanks for pointing it out; fixed now. – ruakh Jul 31 '12 at 12:19

score 4 · Accepted Answer · answered Jul 31 '12 at 07:58

I'm not convinced using an re is the way to go. Python has tokenize, ast, symbol and parser modules that can be used to parse/process/manipulate/re-write Python code...

>>> s = "33.2 + 6 * 0xFF - 0744"
>>> from StringIO import StringIO
>>> import tokenize
>>> t = list(tokenize.generate_tokens(StringIO(s).readline))
>>> t
[(2, '33.2', (1, 0), (1, 4), '33.2 + 6 * 0xFF - 0744'), (51, '+', (1, 5), (1, 6), '33.2 + 6 * 0xFF - 0744'), (2, '6', (1, 7), (1, 8), '33.2 + 6 * 0xFF - 0744'), (51, '*', (1, 9), (1, 10), '33.2 + 6 * 0xFF - 0744'), (2, '0xFF', (1, 11), (1, 15), '33.2 + 6 * 0xFF - 0744'), (51, '-', (1, 16), (1, 17), '33.2 + 6 * 0xFF - 0744'), (2, '0744', (1, 18), (1, 22), '33.2 + 6 * 0xFF - 0744'), (0, '', (2, 0), (2, 0), '')]
>>> nums = [eval(i[1]) for i in t if i[0] == tokenize.NUMBER]
>>> nums
[33.2, 6, 255, 484]
>>> print map(type, nums)
[<type 'float'>, <type 'int'>, <type 'int'>, <type 'int'>]

There's an example at http://docs.python.org/library/tokenize.html that re-writes floats as decimal.Decimal

That is a good point. I wonder if there is a significant speed difference in doing it this way. — asmeurer, Jul 31 '12 at 09:41
@asmeurer Thanks for accepted answer - how did it work out? (any link to see update?) — Jon Clements, Aug 03 '12 at 18:13
see https://github.com/sympy/sympy/pull/1470. Ironically, the hard part was getting IPython to do this automatically. It turns out their API needs updating. — asmeurer, Aug 07 '12 at 22:23

Tim Pietzcker · Answer 4 · 2012-07-31T07:00:22.173

2

If you really want to match both "dialects", you'll get some ambiguities, for example with octals (the o is required in Python 3). But the following should work:

r = r"""(?xi) # Verbose, case-insensitive regex
(?<!\.)       # Assert no dot before the number
\b            # Start of number
(?:           # Match one of the following:
 0x[0-9a-f]+| # Hexadecimal number
 0o?[0-7]+|   # Octal number
 0b[01]+|     # Binary number
 0+|          # Zero
 [1-9]\d*     # Other decimal number
)             # End of alternation
L?            # Optional Long integer
\b            # End of number
(?!\.)        # Assert no dot after the number"""

edited Jul 31 '12 at 07:00

answered Jul 31 '12 at 06:19

Tim Pietzcker

328,213
58
503
561

Yes, I know that I'll have to use different ones for different Pythons, but that's not a big deal as I care only about the running Python version, so a simple sys.version_info will do it for me. – asmeurer Jul 31 '12 at 06:22
Shouldn't it be a raw string? – asmeurer Jul 31 '12 at 06:34
Also, unless I parenthesized it incorrectly for `\1`, it doesn't seem to work correctly for floats (it just matches both ints before and after the `.`) – asmeurer Jul 31 '12 at 06:36
You're right. I had misconstructed the lookaround assertions (it's too early in the morning). Now it should finally work. Sorry. – Tim Pietzcker Jul 31 '12 at 06:37
Also, you don't need any parentheses - `\0` contains the entire match. – Tim Pietzcker Jul 31 '12 at 06:38
`\0` doesn't seem to work, but my parenthesization does. I'll have to do some more rigorous testing, but I think this is my answer. – asmeurer Jul 31 '12 at 06:40
Just noticed a problem. Apparently octal and hexadecimal longs are allowed, but are not supported here. So is `0l`. I think there should be a `L?` at the end of pretty much each line. – asmeurer Jul 31 '12 at 06:43
I think you should just drop the sign part. More accurate would be something like `[+-]*`, because Python allows things like `+--+-+1`, but, as I said, I don't need it (and, at least how I parenthesized it, it doesn't seem to be included in the match anyway). – asmeurer Jul 31 '12 at 06:48
Well, the question asked for a regular expression, and this seems to be the best one in that respect, so I'm marking it as the answer, but I think I'll actually go with the tokenize method from @JonClements answer. Why? The regular expression method will still replace integers in string literals, and any regular expression solution would. – asmeurer Aug 03 '12 at 09:17
@asmeurer: Actually, I think that if the best answer to the question "which regex should I use?" is "don't use a regex, do *this* instead", then you should accept that answer. Choose the answer that helped you best (and that following visitors will learn from the most). – Tim Pietzcker Aug 03 '12 at 12:49

Joel Cornett · Answer 5 · 2012-07-31T07:23:32.743

1

Would something like this suffice?

r = r"""
(?<![\w.])               #Start of string or non-alpha non-decimal point
    0[X][0-9A-F]+L?|     #Hexadecimal
    0[O][0-7]+L?|        #Octal
    0[B][01]+L?|         #Binary
    [1-9]\d*L?           #Decimal/Long Decimal, will not match 0____
(?![\w.])                #End of string or non-alpha non-decimal point
"""

(with flag re.VERBOSE | re.IGNORECASE)

edited Jul 31 '12 at 07:23

answered Jul 31 '12 at 05:27

Joel Cornett

24,192
9
66
88

Instead of `(?:^|[^\w\.])`, you should use `(?<![\w.])`. Same with `(?:$|[^\w\.])`: use `(?![^\w.])`. Otherwise the characters before/after the number will become part of the match. – Tim Pietzcker Jul 31 '12 at 06:07
Also, octals only go up to the digit `7`. And you can make your regex more legible using the `re.I` flag. – Tim Pietzcker Jul 31 '12 at 06:11

score 0 · Answer 6 · answered Jul 31 '12 at 05:00

0

This gets fairly close:

re.match('^(0[x|o|b])?\d+[L|l]?$', '0o123l')

answered Jul 31 '12 at 05:00

Josh Smeaton

47,939
24
129
164

ugh, after looking at some of the answers, mine will provide a lot of false positives, and completely skips hex literals. – Josh Smeaton Jul 31 '12 at 05:02
Wow a downvote for an incomplete answer, even after I mention the limitations? Figure lack of an upvote should be enough. – Josh Smeaton Jul 31 '12 at 05:06
2

In my experience, you gotta just delete your wrong answers, or they will be downvoted into oblivion (though honestly at 10.3k I wouldn't be worrying too much about my reputation if I were you) – asmeurer Jul 31 '12 at 05:09
2

@asmeurer yeah you're right - and I'm not worried too much about reputation as much as education I guess. – Josh Smeaton Jul 31 '12 at 05:19

Regular expression to match a Python integer literal

6 Answers6

Linked