2

let's say I'm building a compiler and I want the lexical analyzer to recognize integers of the C language, can I specify for example that the integer should be between –2,147,483,648 and 2,147,483,647 that a long integer can be 64 bits? I feel that my question is stupid but I want to know if it's doable... thanks

sam smith
  • 59
  • 2

2 Answers2

5

Short answer

Yes that can be done, but you should not do that!

Spoiler alert: you should better be using strtol, and I'm telling you why in the long answer.

Long answer

it can be done using a weirdly crafted regexp (the worst one being a regexp with the list of all integers between MIN and MAX), but you do not want to do such a thing.

This is because such a task would mean a massive processing for the regexp, whereas that test can be done in a very little processing in your favorite language (consider the following as pseudocode):

if (str_to_int(s) > CMIN && str_to_int(s) < CMAX)

Well, actually you might tell me "but if it's an int, it will overflow!". But there are technics to detect that:

and none of them is using a regex!

But anyway, you don't need to go into so much trouble, when there's a function already baked in the C standard library that does that job for you: the strtol function! Quoting the manual:

The strtol() function returns the result of the conversion, unless the value would underflow or overflow. If an underflow occurs, strtol() returns LONG_MIN. If an overflow occurs, strtol() returns LONG_MAX. In both cases, errno is set to ERANGE. Precisely the same holds for strtoll() (with LLONG_MIN and LLONG_MAX instead of LONG_MIN and LONG_MAX).

Why would it be massive? It's because a regexp is an automaton looking at a stream of characters. When there's a match, you move along the automaton. Basically, you'd need to:

  • match any string of 10 characters, or 11 only if it starts with a -
  • that contains only digits,
  • that if it starts with a 2, can only be followed by 0 or 1,
  • that if it starts with a 2, followed by a 1, can only be followed by 0, 1, 2, 3 or 4
  • that if it starts with a 2, followed by a 1 and then a 4, can only be followed by a 1, 2, 3, 47
  • if it starts with a 2, folowed by … and ends with a 7, but if it started with a -, and then a 2, it needs to end with a 6 (so basically you have to copy all the previous conditions into another subgraph that ends with that)
  • and for any other character it's a match.

That would look a bit like the following:

^(
  (
   \d|\d\d|\d\d\d|\d\d\d\d|\d\d\d\d\d|\d\d\d\d\d\d|
   \d\d\d\d\d\d\d|\d\d\d\d\d\d\d\d|\d\d\d\d\d\d\d\d\d|
   [0-2][0-1][0-4][0-7][0-4][0-8][0-3][0-6][0-4][0-8]
  )|
  -(
    \d|\d\d|\d\d\d|\d\d\d\d|\d\d\d\d\d|\d\d\d\d\d\d|
    \d\d\d\d\d\d\d|\d\d\d\d\d\d\d\d|\d\d\d\d\d\d\d\d\d|
    [0-2][0-1][0-4][0-7][0-4][0-8][0-3][0-6][0-4][0-7]
   )
 )$

which is represented visually by the following automaton (click on the image to play with it):

Regular expression visualization

I'm not sure how correct that would be, because I might have missed edge cases, but I hope I made it clear how it compares with doing it in your favorite language. If you actually parse such a huge automaton, it will:

  • burn CPU time,
  • burning electricity,
  • burning (fuel|coal|gaz|uranium),
  • polluting the planet,
  • killing a baby seal

all that instead of doing something that can be done in an operation being 1/100th of the complexity of doing the same thing using a regexp.

cute baby seal

So if you don't want to kill a baby seal because of bad programming, don't use a regexp for something it hasn't been designed for.


Resources

To better understand what that is an automaton, how regexps are working, when is it a good idea to use and when it's a baby seal killing one, I can only advice you to look at the following courses:


Here's the visualization of @Andie2302's answer:

-\b(?:
     214748364[0-8]|21474836[0-3][0-9]|2147483[0-5][0-9]{2}|
     214748[0-2][0-9]{3}|21474[0-7][0-9]{4}|2147[0-3][0-9]{5}|
     214[0-6][0-9]{6}|21[0-3][0-9]{7}|20[0-9]{8}|1[0-9]{9}|
     [1-9][0-9]{1,8}|[0-9]|-0
 )\b|
 \b(?:
     214748364[0-7]|21474836[0-3][0-9]|2147483[0-5][0-9]{2}|
     214748[0-2][0-9]{3}|21474[0-7][0-9]{4}|2147[0-3][0-9]{5}|
     214[0-6][0-9]{6}|21[0-3][0-9]{7}|20[0-9]{8}|1[0-9]{9}|
     [1-9][0-9]{1,8}|[0-9]|-0
 )\b

through its matching automaton:

Regular expression visualization, click on me to play with me!

Still not convinced?

HTH

Community
  • 1
  • 1
zmo
  • 24,463
  • 4
  • 54
  • 90
  • 1
    Instead of `\d|\d\d|\d\d\d|` etc., wouldn’t simply `\d{1,9}` achieve the same result? – Sebastian Simon Mar 09 '15 at 00:14
  • indeed, but all in all, `\d{0,9}` is just hiding the states of the automaton, my point is to prove how massive is building an automaton for such a simple task. – zmo Mar 09 '15 at 00:15
  • Your regex is wrong: https://regex101.com/r/pR2gS1/1. It fails to match 2147483549 or 1234567890. (I agree with the point in your answer, though) – nhahtdh Mar 09 '15 at 05:43
  • 1
    OK, but let's not exaggerate too much. Flex (which does not do DFA minimization) produces a state machine with 89 states and about 2.5k of tables. (It could actually be less than 1k, but flex prefers using int16_t even when it could use uint8_t.) – rici Mar 09 '15 at 06:35
  • @zmo: In DFA, `\d|\d\d|...` and `\d{0,9}` would produce the same number of states. However, depending on implementation of backtracking engine, the resulting states will be less when you write `\d{0,9}`, and there will also be less backtracking. – nhahtdh Mar 09 '15 at 08:12
1

This regex should work:

-\b(?:214748364[0-8]|21474836[0-3][0-9]|2147483[0-5][0-9]{2}|214748[0-2][0-9]{3}|21474[0-7][0-9]{4}|2147[0-3][0-9]{5}|214[0-6][0-9]{6}|21[0-3][0-9]{7}|20[0-9]{8}|1[0-9]{9}|[1-9][0-9]{1,8}|[0-9])\b|\b(?:214748364[0-7]|21474836[0-3][0-9]|2147483[0-5][0-9]{2}|214748[0-2][0-9]{3}|21474[0-7][0-9]{4}|2147[0-3][0-9]{5}|214[0-6][0-9]{6}|21[0-3][0-9]{7}|20[0-9]{8}|1[0-9]{9}|[1-9][0-9]{1,8}|[0-9])\b
Andie2302
  • 4,825
  • 4
  • 24
  • 43
  • indeed, it might, but it only helps me to make my point, as the automaton you're generating is just as massive! Just remember that we're talking about converting a string into an int, check if it's within boundaries, which in any language is just as long as going through an array of characters. Just for the fun I'm adding a visualization of your regexp at the end of my answer. N.B.: I'm sure it's better than the one I built. – zmo Mar 09 '15 at 00:11
  • 1
    Why do you include `-0` in `[0-9]|-0)`? Shouldn't `[0-9]` already cover that? – nhahtdh Mar 09 '15 at 05:45
  • 1
    @nhahtdh You are right, the regex also can be written more compact since there are two parts and part one and two differ only in one character. – Andie2302 Mar 09 '15 at 07:56