is there a regex to generate all integers for a certain programming language

Question

let's say I'm building a compiler and I want the lexical analyzer to recognize integers of the C language, can I specify for example that the integer should be between –2,147,483,648 and 2,147,483,647 that a long integer can be 64 bits? I feel that my question is stupid but I want to know if it's doable... thanks

The job of the lexer is only to segment and identify the parts of your source code, the validation of ranges is not its job, let this job to the parser. — Casimir et Hippolyte, Mar 09 '15 at 01:07

score 5 · Answer 1 · edited May 23 '17 at 12:20

Short answer

Yes that can be done, but you should not do that!

Spoiler alert: you should better be using strtol, and I'm telling you why in the long answer.

Long answer

it can be done using a weirdly crafted regexp (the worst one being a regexp with the list of all integers between MIN and MAX), but you do not want to do such a thing.

This is because such a task would mean a massive processing for the regexp, whereas that test can be done in a very little processing in your favorite language (consider the following as pseudocode):

if (str_to_int(s) > CMIN && str_to_int(s) < CMAX)

Well, actually you might tell me "but if it's an int, it will overflow!". But there are technics to detect that:

How to detect integer overflow?

and none of them is using a regex!

But anyway, you don't need to go into so much trouble, when there's a function already baked in the C standard library that does that job for you: the strtol function! Quoting the manual:

The strtol() function returns the result of the conversion, unless the value would underflow or overflow. If an underflow occurs, strtol() returns LONG_MIN. If an overflow occurs, strtol() returns LONG_MAX. In both cases, errno is set to ERANGE. Precisely the same holds for strtoll() (with LLONG_MIN and LLONG_MAX instead of LONG_MIN and LONG_MAX).

Why would it be massive? It's because a regexp is an automaton looking at a stream of characters. When there's a match, you move along the automaton. Basically, you'd need to:

match any string of 10 characters, or 11 only if it starts with a -
that contains only digits,
that if it starts with a 2, can only be followed by 0 or 1,
that if it starts with a 2, followed by a 1, can only be followed by 0, 1, 2, 3 or 4
that if it starts with a 2, followed by a 1 and then a 4, can only be followed by a 1, 2, 3, 4 … 7
…
if it starts with a 2, folowed by … and ends with a 7, but if it started with a -, and then a 2, it needs to end with a 6 (so basically you have to copy all the previous conditions into another subgraph that ends with that)
and for any other character it's a match.

That would look a bit like the following:

^(
  (
   \d|\d\d|\d\d\d|\d\d\d\d|\d\d\d\d\d|\d\d\d\d\d\d|
   \d\d\d\d\d\d\d|\d\d\d\d\d\d\d\d|\d\d\d\d\d\d\d\d\d|
   [0-2][0-1][0-4][0-7][0-4][0-8][0-3][0-6][0-4][0-8]
  )|
  -(
    \d|\d\d|\d\d\d|\d\d\d\d|\d\d\d\d\d|\d\d\d\d\d\d|
    \d\d\d\d\d\d\d|\d\d\d\d\d\d\d\d|\d\d\d\d\d\d\d\d\d|
    [0-2][0-1][0-4][0-7][0-4][0-8][0-3][0-6][0-4][0-7]
   )
 )$

which is represented visually by the following automaton (click on the image to play with it):

I'm not sure how correct that would be, because I might have missed edge cases, but I hope I made it clear how it compares with doing it in your favorite language. If you actually parse such a huge automaton, it will:

burn CPU time,
burning electricity,
burning (fuel|coal|gaz|uranium),
polluting the planet,
killing a baby seal

all that instead of doing something that can be done in an operation being 1/100th of the complexity of doing the same thing using a regexp.

cute baby seal

So if you don't want to kill a baby seal because of bad programming, don't use a regexp for something it hasn't been designed for.

Resources

To better understand what that is an automaton, how regexps are working, when is it a good idea to use and when it's a baby seal killing one, I can only advice you to look at the following courses:

Here's the visualization of @Andie2302's answer:

-\b(?:
     214748364[0-8]|21474836[0-3][0-9]|2147483[0-5][0-9]{2}|
     214748[0-2][0-9]{3}|21474[0-7][0-9]{4}|2147[0-3][0-9]{5}|
     214[0-6][0-9]{6}|21[0-3][0-9]{7}|20[0-9]{8}|1[0-9]{9}|
     [1-9][0-9]{1,8}|[0-9]|-0
 )\b|
 \b(?:
     214748364[0-7]|21474836[0-3][0-9]|2147483[0-5][0-9]{2}|
     214748[0-2][0-9]{3}|21474[0-7][0-9]{4}|2147[0-3][0-9]{5}|
     214[0-6][0-9]{6}|21[0-3][0-9]{7}|20[0-9]{8}|1[0-9]{9}|
     [1-9][0-9]{1,8}|[0-9]|-0
 )\b

through its matching automaton:

Still not convinced?

HTH

Instead of `\d|\d\d|\d\d\d|` etc., wouldn’t simply `\d{1,9}` achieve the same result? — Sebastian Simon, Mar 09 '15 at 00:14
indeed, but all in all, `\d{0,9}` is just hiding the states of the automaton, my point is to prove how massive is building an automaton for such a simple task. — zmo, Mar 09 '15 at 00:15
Your regex is wrong: https://regex101.com/r/pR2gS1/1. It fails to match 2147483549 or 1234567890. (I agree with the point in your answer, though) — nhahtdh, Mar 09 '15 at 05:43
OK, but let's not exaggerate too much. Flex (which does not do DFA minimization) produces a state machine with 89 states and about 2.5k of tables. (It could actually be less than 1k, but flex prefers using int16_t even when it could use uint8_t.) — rici, Mar 09 '15 at 06:35
@zmo: In DFA, `\d|\d\d|...` and `\d{0,9}` would produce the same number of states. However, depending on implementation of backtracking engine, the resulting states will be less when you write `\d{0,9}`, and there will also be less backtracking. — nhahtdh, Mar 09 '15 at 08:12

Andie2302 · Answer 2 · 2015-03-09T07:51:14.553

1

This regex should work:

-\b(?:214748364[0-8]|21474836[0-3][0-9]|2147483[0-5][0-9]{2}|214748[0-2][0-9]{3}|21474[0-7][0-9]{4}|2147[0-3][0-9]{5}|214[0-6][0-9]{6}|21[0-3][0-9]{7}|20[0-9]{8}|1[0-9]{9}|[1-9][0-9]{1,8}|[0-9])\b|\b(?:214748364[0-7]|21474836[0-3][0-9]|2147483[0-5][0-9]{2}|214748[0-2][0-9]{3}|21474[0-7][0-9]{4}|2147[0-3][0-9]{5}|214[0-6][0-9]{6}|21[0-3][0-9]{7}|20[0-9]{8}|1[0-9]{9}|[1-9][0-9]{1,8}|[0-9])\b

edited Mar 09 '15 at 07:51

answered Mar 09 '15 at 00:06

Andie2302

4,825
4
24
43

indeed, it might, but it only helps me to make my point, as the automaton you're generating is just as massive! Just remember that we're talking about converting a string into an int, check if it's within boundaries, which in any language is just as long as going through an array of characters. Just for the fun I'm adding a visualization of your regexp at the end of my answer. N.B.: I'm sure it's better than the one I built. – zmo Mar 09 '15 at 00:11
1

Why do you include `-0` in `[0-9]|-0)`? Shouldn't `[0-9]` already cover that? – nhahtdh Mar 09 '15 at 05:45
1

@nhahtdh You are right, the regex also can be written more compact since there are two parts and part one and two differ only in one character. – Andie2302 Mar 09 '15 at 07:56

is there a regex to generate all integers for a certain programming language

2 Answers2

Short answer

Long answer

Resources