16

The re.DEBUG flag offers a peek at the inner workings of a regular expression pattern in Python, for example:

import re

re.compile(r"(a(?:b)){1,3}(c)", re.DEBUG)

Returns:

MAX_REPEAT 1 3
  SUBPATTERN 1 0 0
    LITERAL 97
    LITERAL 98
SUBPATTERN 2 0 0
  LITERAL 99

 0. INFO 4 0b0 3 7 (to 5)
 5: REPEAT 11 1 3 (to 17)
 9.   MARK 0
11.   LITERAL 0x61 ('a')
13.   LITERAL 0x62 ('b')
15.   MARK 1
17: MAX_UNTIL
18. MARK 2
20. LITERAL 0x63 ('c')
22. MARK 3
24. SUCCESS

Where can I find the meaning of the OPCODES (SUBPATTERN, MAX_REPEAT, etc.)? Some of them are self-explanatory, but the whole purpose is unclear. What does 1 0 0 means in SUBPATTERN 1 0 0?

Some things I've tried:

Note: I know that perhaps this is not a perfect fit for a StackOverflow question, but I've written a clear problem with an MRE and my efforts at solving the issue at hand. Moreover, I think having this solved benefits the other users as well.

Dani Mesejo
  • 61,499
  • 6
  • 49
  • 76
  • 1
    A related [thread](https://stackoverflow.com/questions/606350/how-can-i-debug-a-regular-expression-in-python), though none of the answers go in depth into this. – metatoaster Apr 11 '23 at 10:16

2 Answers2

4

I think I can answer most of your question but perhaps not all of it. So the OPCODES are are part of the internals of the python re module, and don't seem to be meant to be user-facing.

That said,

MAX_REPEAT 1 3 indicates a repetition of a pattern 1 to 3 times.

LITERAL is just a character literal (e.g. ascii for 'a' is 97).

SUBPATTERN 1 0 0 matches the 1st subpattern (e.g. group), SUBPATTERN 2 0 0 matches the second subpattern, etc. As far as I can tell, the 0's here are just unused placeholders, but this is the one part I'm not 100% clear on.

Anyway, if you are actually trying to debug some regex, I would recommend instead using one of the many nice online regex debuggers (e.g. https://regex101.com/).

Plonetheus
  • 704
  • 3
  • 11
4

Looks like a lot of the logic is in the C module, I'm not too familiar with the language so this is only basic findings, but it seems to be more specific to the 2nd part of the debug output (the bit with line numbers).

Here's a few I found where comments explain the output.

REPEAT/MAX_UNTIL/MIN_UNTIL:

<REPEAT> <skip> <1=min> <2=max> item <UNTIL/MIN_UNTIL/MAX_UNTIL> tail

INFO: Optimization info block. If SRE_INFO_PREFIX or SRE_INFO_CHARSET is in the flags, more follows.

<INFO> <1=skip> <2=flags> <3=min> <4=max> <5=prefix info>

LITERAL: Match literal string. This is used for short prefixes, and if fast search is disabled.

<LITERAL> <code>

MARK: Set a mark, likely for backtracking.

<MARK> <gid>

All available opcodes if anyone wants to try to dig a little more:

SRE_OP_FAILURE
SRE_OP_SUCCESS
SRE_OP_ANY
SRE_OP_ANY_ALL
SRE_OP_ASSERT
SRE_OP_ASSERT_NOT
SRE_OP_AT
SRE_OP_BRANCH
SRE_OP_CALL
SRE_OP_CATEGORY
SRE_OP_CHARSET
SRE_OP_BIGCHARSET
SRE_OP_GROUPREF
SRE_OP_GROUPREF_EXISTS
SRE_OP_GROUPREF_IGNORE
SRE_OP_IN
SRE_OP_IN_IGNORE
SRE_OP_INFO
SRE_OP_JUMP
SRE_OP_LITERAL
SRE_OP_LITERAL_IGNORE
SRE_OP_MARK
SRE_OP_MAX_UNTIL
SRE_OP_MIN_UNTIL
SRE_OP_NOT_LITERAL
SRE_OP_NOT_LITERAL_IGNORE
SRE_OP_NEGATE
SRE_OP_RANGE
SRE_OP_REPEAT
SRE_OP_REPEAT_ONE
SRE_OP_SUBPATTERN
SRE_OP_MIN_REPEAT_ONE
Peter
  • 3,186
  • 3
  • 26
  • 59