I want to build a tokenizer that allows me to pass in patterns.
As i understand in a normal OR group the first match wins.
This pattern:
(?<integer>[0-9]+)|(?<float>[0-9]+[.][0-9]+)|(?<invalid>[^\s]+)
Would never match the float
group since the integer
group would always be matched first.
The behaviour i want is that the first two groups match as greedy as they can and the last group matches as ungreedy as it can.
2.2BLA3.1
should be matched as float(2.2), invalid(BLA), float(3.1)
my usecase does not allow me to give the tokens a fixed ordering, so i have to solve this by adding additional control characters to the regex expresion.
What needs to be added?
EDIT:
There have been great suggestions so far, thanks in advance. One suggestion is to change the Ordering. Unfortunatelly my usecase does not allow me to give the tokens a fixed ordering. So i can not predict the order in which i am given the group information.
Another very interesting one is to make the integer
group more restrictive. This will also not fit the usecase. I did not mention this before but essentally i get a list of tuples tokenname
and tokenpattern
and i have to fit them into one big pattern.
(?<integer>[0-9]+)|(?<float>[0-9]+[.][0-9]+)|(?<invalid>[^\s]+)
This pattern can be the result of reworking a list like
{
{"integer","[0-9]+"},
{"float","[0-9]+[.][0-9]+"}
}
When i asked what needs to be added i was hoping to use some control sequence to change the behaviour of the groups themselves.