3

The intro is a bit lengthy, so please bear with me. :)

I am writing a simple regex-based parser for a large source file written in assembler. Most of these instructions are just moving, adding, subtracting and jumping around, but it's a pretty large file which I need to port to two different languages and I am too lazy to do it manually. That's the requirement and I can't do much 'bout it (so please don't answer stuff like "why don't you simply use ANTLR").

So, after I do some preprocessing (I already did this part: replaced defines and macros and stripped redundant whitespace and comments), I now basically have to read the file line by line and parse one or potentially more lines into "intermediate" instructions, which I will then use to generate more or less 1-to-1 equivalent (using actual integer arithmetics and a bunch of GOTOs).

So, presuming that I can have all these different addressing modes:

Addressing mode depends on the format of the instruction

I can go two different ways:

  1. Have a single MOV regex which will handle all these cases, or
  2. Have multiple MOV regexes, on for each instruction type. The problem with this approach is that I would have to design each regex pretty carefully to avoid any ambiguity. AND it seems there would be lots of duplicates since source and destination operands share many of the addressing modes.

My question is: If I have a single regex for all instructions, how should I specify my groups and captures to be able to simply differentiate between different modes?

Or do I simply catch everything and then process the source/destination address after the initial match?

E.g. a rather simple match-all regex would be:

^MOV\s+(?<dest>[^\s,]+)[\s,]*(?<src>[^\s,]+)$

(Split into multiple lines with comments):

^MOV              (?#instruction)
\s+               (?#some whitespace)
(?<dest>[^\s,]+)  (?#match everything except whitespace and comma)
\s*,\s*           (?#match comma, allow some whitespace)
(?<src>[^\s,]+)   (?#match everything except whitespace and comma)$

So, I can certainly do this and then process dest and src groups separately. But would it be better to create a nasty complex regex to match all cases from the table below? In that case I am not sure how I would interpret these captures to understand what addressing mode was matched.

I am using C#, if that makes any difference.

us2012
  • 16,083
  • 3
  • 46
  • 62
Lou
  • 4,244
  • 3
  • 33
  • 72
  • 1
    don't try always solving such a problem with **just pure Regex**, in fact we need more **preprocessing and postprocessing** besides the output Regex can bring. – King King Sep 29 '13 at 10:39
  • Why not just parse it from top to bottom? .. why the regex need? – Simon Whitehead Sep 29 '13 at 10:58
  • Hm, your comments make sense, I am probably overthinking this. The thing is, I already have a bunch of regexes for other instructions which were quicker for me to write and test than iterating through characters manually, but for these instructions with multiple variants (like MOV), it's probably best to simply match the opcode and then parse the rest of it using a couple of if-clauses. – Lou Sep 29 '13 at 11:07
  • Need to port to *what* other two languages? Two other assemblers? Similar instruction sets? Similar syntax? What does it mean to "port" the individual instructions? – Ira Baxter Sep 29 '13 at 18:17

2 Answers2

1

Here's a regex that does pretty much what you want (you'll have to edit for the actual data forms; i.e. instead of all the register labels ax, bx, ... I just used 'reg', etc.)

 (?<Opt1>MOV\s*Rw,\sRw)
|(?<Opt2>MOV\s*Rw,\s\#data4)
|(?<Opt3>MOV\s*Rw,\s\#data16)
|(?<Opt4>MOV\s*Rw,\s\[Rw\])
|(?<Opt5>MOV\s*Rw,\s\[Rw\+\])
|(?<Opt6>MOV\s*\[Rw\],\sRw)
|(?<Opt7>MOV\s*\[-Rw\],\sRw)
|(?<Opt8>MOV\s*\[Rw\],\s\[Rw\])
|(?<Opt9>MOV\s*\[Rw\+\],\s\[Rw\])
|(?<OptA>MOV\s*\[Rw\],\s\[Rw\+\]) 

using this data:

MOV Rw, Rw
MOV Rw, #data4
MOV Rw, #data16
MOV Rw, [Rw]
MOV Rw, [Rw+]
MOV [Rw], Rw
MOV [-Rw], Rw
MOV [Rw], [Rw]
MOV [Rw+], [Rw]
MOV [Rw], [Rw+]

RegexBuddy generates this:

Match 1:    MOV Rw, Rw       0      10
Group "Opt1":   MOV Rw, Rw       0      10
Group "Opt2" did not participate in the match
Group "Opt3" did not participate in the match
Group "Opt4" did not participate in the match
Group "Opt5" did not participate in the match
Group "Opt6" did not participate in the match
Group "Opt7" did not participate in the match
Group "Opt8" did not participate in the match
Group "Opt9" did not participate in the match
Group "OptA" did not participate in the match
Match 2:    MOV Rw, #data4      12      14
Group "Opt1" did not participate in the match
Group "Opt2":   MOV Rw, #data4      12      14
Group "Opt3" did not participate in the match
Group "Opt4" did not participate in the match
Group "Opt5" did not participate in the match
Group "Opt6" did not participate in the match
Group "Opt7" did not participate in the match
Group "Opt8" did not participate in the match
Group "Opt9" did not participate in the match
Group "OptA" did not participate in the match
Match 3:    MOV Rw, #data16     28      15
Group "Opt1" did not participate in the match
Group "Opt2" did not participate in the match
Group "Opt3":   MOV Rw, #data16     28      15
Group "Opt4" did not participate in the match
Group "Opt5" did not participate in the match
Group "Opt6" did not participate in the match
Group "Opt7" did not participate in the match
Group "Opt8" did not participate in the match
Group "Opt9" did not participate in the match
Group "OptA" did not participate in the match
Match 4:    MOV Rw, [Rw]        45      12
Group "Opt1" did not participate in the match
Group "Opt2" did not participate in the match
Group "Opt3" did not participate in the match
Group "Opt4":   MOV Rw, [Rw]        45      12
Group "Opt5" did not participate in the match
Group "Opt6" did not participate in the match
Group "Opt7" did not participate in the match
Group "Opt8" did not participate in the match
Group "Opt9" did not participate in the match
Group "OptA" did not participate in the match
Match 5:    MOV Rw, [Rw+]       59      13
Group "Opt1" did not participate in the match
Group "Opt2" did not participate in the match
Group "Opt3" did not participate in the match
Group "Opt4" did not participate in the match
Group "Opt5":   MOV Rw, [Rw+]       59      13
Group "Opt6" did not participate in the match
Group "Opt7" did not participate in the match
Group "Opt8" did not participate in the match
Group "Opt9" did not participate in the match
Group "OptA" did not participate in the match
Match 6:    MOV [Rw], Rw        74      12
Group "Opt1" did not participate in the match
Group "Opt2" did not participate in the match
Group "Opt3" did not participate in the match
Group "Opt4" did not participate in the match
Group "Opt5" did not participate in the match
Group "Opt6":   MOV [Rw], Rw        74      12
Group "Opt7" did not participate in the match
Group "Opt8" did not participate in the match
Group "Opt9" did not participate in the match
Group "OptA" did not participate in the match
Match 7:    MOV [-Rw], Rw       88      13
Group "Opt1" did not participate in the match
Group "Opt2" did not participate in the match
Group "Opt3" did not participate in the match
Group "Opt4" did not participate in the match
Group "Opt5" did not participate in the match
Group "Opt6" did not participate in the match
Group "Opt7":   MOV [-Rw], Rw       88      13
Group "Opt8" did not participate in the match
Group "Opt9" did not participate in the match
Group "OptA" did not participate in the match
Match 8:    MOV [Rw], [Rw]     103      14
Group "Opt1" did not participate in the match
Group "Opt2" did not participate in the match
Group "Opt3" did not participate in the match
Group "Opt4" did not participate in the match
Group "Opt5" did not participate in the match
Group "Opt6" did not participate in the match
Group "Opt7" did not participate in the match
Group "Opt8":   MOV [Rw], [Rw]     103      14
Group "Opt9" did not participate in the match
Group "OptA" did not participate in the match
Match 9:    MOV [Rw+], [Rw]    119      15
Group "Opt1" did not participate in the match
Group "Opt2" did not participate in the match
Group "Opt3" did not participate in the match
Group "Opt4" did not participate in the match
Group "Opt5" did not participate in the match
Group "Opt6" did not participate in the match
Group "Opt7" did not participate in the match
Group "Opt8" did not participate in the match
Group "Opt9":   MOV [Rw+], [Rw]    119      15
Group "OptA" did not participate in the match
Match 10:   MOV [Rw], [Rw+]    136      15
Group "Opt1" did not participate in the match
Group "Opt2" did not participate in the match
Group "Opt3" did not participate in the match
Group "Opt4" did not participate in the match
Group "Opt5" did not participate in the match
Group "Opt6" did not participate in the match
Group "Opt7" did not participate in the match
Group "Opt8" did not participate in the match
Group "Opt9" did not participate in the match
Group "OptA":   MOV [Rw], [Rw+]    136      15
Dweeberly
  • 4,668
  • 2
  • 22
  • 41
1

You are discovering what happens when you attempt to bring a lexer to a parser's job. Much of your difficulty I think is in trying to do too much with the regexes.

And yes, I'm going to suggest a parser like ANTLR or equivalent.

If you went that route, you'd write a whole lot of little regexps to identify tokens ("MOV", "#", "[", ...) and then you'd write a grammar that defined how these compose into instructions. If nothing else, this makes it a lot easier to simply write the parsing part.

You can see what an assembler code this looks like. (Uses a system other than ANTLR, but the ideas are the same). This was pretty straightforward to write, and there was no agony about trying to write the One Regex to Rule them All. [I did that example in an evening, and used it parse a rather large set of sources].

You were unclear on what "port" meant. Presumably you are going to another assembler syntax, if not another machine architecture. To do that well, you'll need access to various instruction parts (which a single regex for all possible MOV instructions won't give you). Here is the beauty of parsing and producing trees: all those parts are exposed to you, embedded in the structure in which they belong. You can even generate single instructions from multiple assembly language statements, because the tree holds the entire program. (Rather large doesn't mean much in terms of tree size on systems with a gigabyte of RAM).

Community
  • 1
  • 1
Ira Baxter
  • 93,541
  • 22
  • 172
  • 341
  • Ok, I guess you're right. There is no "right" way to do it the way I am doing it now, except for going the other direction. Never mind, the instruction set is pretty limited so I've already hand-parsed the operand types after matching the binary instruction. – Lou Oct 02 '13 at 06:59
  • I'd say ANTLR is overkill for this problem. Your first approach looks fine to me. – Michael Dyck Oct 02 '13 at 18:59