2

I want to build a tokenizer that allows me to pass in patterns.

As i understand in a normal OR group the first match wins.

This pattern:

(?<integer>[0-9]+)|(?<float>[0-9]+[.][0-9]+)|(?<invalid>[^\s]+)

Would never match the float group since the integer group would always be matched first. The behaviour i want is that the first two groups match as greedy as they can and the last group matches as ungreedy as it can.

2.2BLA3.1 should be matched as float(2.2), invalid(BLA), float(3.1)

my usecase does not allow me to give the tokens a fixed ordering, so i have to solve this by adding additional control characters to the regex expresion.

What needs to be added?


EDIT:

There have been great suggestions so far, thanks in advance. One suggestion is to change the Ordering. Unfortunatelly my usecase does not allow me to give the tokens a fixed ordering. So i can not predict the order in which i am given the group information.

Another very interesting one is to make the integer group more restrictive. This will also not fit the usecase. I did not mention this before but essentally i get a list of tuples tokenname and tokenpattern and i have to fit them into one big pattern.

(?<integer>[0-9]+)|(?<float>[0-9]+[.][0-9]+)|(?<invalid>[^\s]+)

This pattern can be the result of reworking a list like

{
    {"integer","[0-9]+"},
    {"float","[0-9]+[.][0-9]+"}
}

When i asked what needs to be added i was hoping to use some control sequence to change the behaviour of the groups themselves.

Johannes
  • 6,490
  • 10
  • 59
  • 108
  • Duplicate of [Why won't a longer token in an alternation be matched?](http://stackoverflow.com/q/25511528/3622940) – Unihedron Sep 18 '14 at 14:11

3 Answers3

2
(?<integer>(?:[0-9](?!\d*\.))+)|(?<float>[0-9]+[.][0-9]+)|(?<invalid>[^\s]+)

You can try this.See demo.

http://regex101.com/r/bZ8aY1/2

vks
  • 67,027
  • 10
  • 91
  • 124
1

If you append (?![.]) to the definition of integer (that is a zero-width lookahead that matches only if there is no dot after the current position), it should work. Otherwise, you could try to switch <float> and <integer>.

llogiq
  • 13,815
  • 8
  • 40
  • 72
0

An integer looks the same as a float but has a more strict regex, so it should be safe to look for a float before an integer. This way, if it can match a float at all then it will, if it can't then it'll look for just a regular integer instead:

(?<float>[0-9]+[.][0-9]+)|(?<integer>[0-9]+)|(?<invalid>[^\s]+)

Then to make the last group (invalid) as non-greedy as possible, you can use the +? modifier (although it's worth noting that this will match one character at a time into the invalid matches result):

(?<float>[0-9]+[.][0-9]+)|(?<integer>[0-9]+)|(?<invalid>[^\s]+?)

It's also worth mentioning that .75 is technically a valid floating point value - you may want to update it so that the integer part of the floating point value is optional:

(?<float>[0-9]*[.][0-9]+)|(?<integer>[0-9]+)|(?<invalid>[^\s]+?)
Joe
  • 15,669
  • 4
  • 48
  • 83
  • Yes, but my usecase does not allow me to give the tokens a fixed ordering. So i have no control wheather or not `integer` ends up before `float` or not. – Johannes Sep 18 '14 at 13:01
  • I'm not sure I follow you. It doesn't matter what order they appear in the data being matched against, just by trying to match a float before an integer you solve the problem. All floats start with an integer value, but not all integers start with a floating point value – Joe Sep 18 '14 at 13:04
  • The parts are passed into the function that creates the combination. All i could do is sort them by length but that is guessing. – Johannes Sep 18 '14 at 14:03