Clean solution
There is a clean solution to this problem. Just hoist the regexes out of the case-clauses where they aren't supported and into the match-clause which supports any Python object.
The combined regex will also give you better efficiency than could be had by having a series of separate regex tests. Also, the regex can be precompiled for maximum efficiency during the match process.
Example
Here's a worked out example for a simple tokenizer:
pattern = re.compile(r'(\d+\.\d+)|(\d+)|(\w+)|(".*)"')
Token = namedtuple('Token', ('kind', 'value', 'position'))
env = {'x': 'hello', 'y': 10}
for s in ['123', '123.45', 'x', 'y', '"goodbye"']:
mo = pattern.fullmatch(s)
match mo.lastindex:
case 1:
tok = Token('NUM', float(s), mo.span())
case 2:
tok = Token('NUM', int(s), mo.span())
case 3:
tok = Token('VAR', env[s], mo.span())
case 4:
tok = Token('TEXT', s[1:-1], mo.span())
case _:
raise ValueError(f'Unknown pattern for {s!r}')
print(tok)
This outputs:
Token(kind='NUM', value=123, position=(0, 3))
Token(kind='NUM', value=123.45, position=(0, 6))
Token(kind='VAR', value='hello', position=(0, 1))
Token(kind='VAR', value=10, position=(0, 1))
Token(kind='TEXT', value='goodbye', position=(0, 9))
Better Example
The code can be improved by writing the combined regex in verbose format for intelligibility and ease of adding more cases. It can be further improved by naming the regex sub patterns:
pattern = re.compile(r"""(?x)
(?P<float>\d+\.\d+) |
(?P<int>\d+) |
(?P<variable>\w+) |
(?P<string>".*")
""")
That can be used in a match/case statement like this:
for s in ['123', '123.45', 'x', 'y', '"goodbye"']:
mo = pattern.fullmatch(s)
match mo.lastgroup:
case 'float':
tok = Token('NUM', float(s), mo.span())
case 'int':
tok = Token('NUM', int(s), mo.span())
case 'variable':
tok = Token('VAR', env[s], mo.span())
case 'string':
tok = Token('TEXT', s[1:-1], mo.span())
case _:
raise ValueError(f'Unknown pattern for {s!r}')
print(tok)
Comparison to if/elif/else
Here is the equivalent code written using an if-elif-else chain:
for s in ['123', '123.45', 'x', 'y', '"goodbye"']:
if (mo := re.fullmatch('\d+\.\d+', s)):
tok = Token('NUM', float(s), mo.span())
elif (mo := re.fullmatch('\d+', s)):
tok = Token('NUM', int(s), mo.span())
elif (mo := re.fullmatch('\w+', s)):
tok = Token('VAR', env[s], mo.span())
elif (mo := re.fullmatch('".*"', s)):
tok = Token('TEXT', s[1:-1], mo.span())
else:
raise ValueError(f'Unknown pattern for {s!r}')
print(tok)
Compared to the match/case, the if-elif-else chain is slower because it runs multiple regex matches and because there is no precompilation. Also, it is less maintainable without the case names.
Because all the regexes are separate we have to capture all the match objects separately with repeated use of assignment expressions with the walrus operator. This is awkward compared to the match/case example where we only make a single assignment.