2

I'm trying to create a compiler in python and I'm using the re module to create tokens. The language will be very similar to Assembly.

Almost everything is working, but I'm having trouble with a token. Let me give an example of what would be this token:

mov [eax], 4
mov [name],2
mov eax, [ebx]

Tokens: [eax], [ebx]

I can find what I want using this pattern: \[(eax|ebx)\] But I get an error when use with other patterns, I believe it is because of the '|'.

SCANNER = re.compile(r"""
    ;(.)*                    # comment
    |(\[-?[0-9]+\])          # memory_int
    |(\[-?0x[0-9a-fA-F]+\])      # memory_hex
    |(\[(eax|ebx)\])             # memory access with registers
    """, re.VERBOSE)

for match in re.finditer(SCANNER, lines[i]):
            comment, memory_int, memory_hex, memory_reg = match.groups()

Error:

ValueError: too many values to unpack (expected 4)

Is there any way to replace the '|' with another character?

Batata
  • 133
  • 2
  • 10
  • as a non-answer comment, i do heartily recommend you use `re.foo(..., flags=X)` explicitly rather than implicitly. not all `re` methods have flags as the third variable, so typing in `flags=X` out of habit will save you from a big headache some night. – FirefighterBlu3 Apr 29 '15 at 02:05

3 Answers3

2

Your heartache is being caused by a capturing group within a capturing group which is causing a 5-tuple to be returned by each match's groups() call. Instead of using a capturing group, try mixing in a non-capturing group (syntax: (?:pattern)) into your final capturing group as follows:

(\[(?:eax|ebx)\])

Example run:

>>> SCANNER = re.compile(r';(.)*|(\[-?[0-9]+\])|(\[-?0x[0-9a-fA-F]+\])|(\[(?:eax|ebx)\])')
>>> next(re.finditer(SCANNER, 'mov eax, [ebx]')).groups()
(None, None, None, '[ebx]')
Shashank
  • 13,713
  • 5
  • 37
  • 63
1

My suggestion here would be to just ignore the value when unpacking.

comment, memory_int, memory_hex, memory_reg, _ = match.groups()

Or:

comment, memory_int, memory_hex, memory_reg = match.groups()[:3]
meiamsome
  • 2,876
  • 1
  • 17
  • 19
1

The problem isn't because of the | characters in:

    |(\[(eax|ebx)\])             # memory access with registers

It's because that part of the expression is defining two capturing groups, one nested inside the other — so match.groups() is returning more values than could be unpacked, such as this for first line:

(None, None, None, '[eax]', 'eax')

One way to avoid the nested group would be to instead use:

    |(\[eax\]|\[ebx\])          # memory access with registers

which would result in this being returned:

(None, None, None, '[eax]')

As @Shashank pointed out, you could also use non-capturing group (?:...) syntax to define the nested possible register value patterns:

    |(\[(?:eax|ebx)\])          # memory access with registers

to achieve the same thing. That approach is advantageous when there are a larger number of possible sub-patterns (and they're more complicated) because otherwise you'd need to spell out the entire pattern in full for each possibility rather than take advantage of some commonality they might have.

martineau
  • 119,623
  • 25
  • 170
  • 301