Assembly language parser implementation

Question

So, i'm continuing my wandering around and pretty sure i'm ended up in need of some open source assembler command lexem analyzer (some TinyPG implementation, maybe).

All i want to know, is HOW i can make my apps understand, that given text MIGHT be assembler code. for example

mov ah, 37

should be accepted, while

bad my 42

should not.

Advices on self-implementing are welcomed too, ofc. Because i'm not sure if i would understand "hardcore" implementations.

Assembly language is quite different for different architectures. Also within the same architecture, different assemblers might have slightly different directives for stuff other than the actual instructions. Your example is x86 so is it only x86 you want to be able to detect? — Anders Abel, Jun 02 '13 at 14:10
not whole x86 family(like it would have much difference). Planning on i8086 commands analizer — user2380317, Jun 02 '13 at 14:21
Modern machine architectures are pretty complex, and offer a wide variety of instructions. Traditionally assemblers have hand-code parsers but that gets harder over time. See my SO answer for assembler styles using real parser generators: http://stackoverflow.com/a/1317779/120163 — Ira Baxter, Jun 02 '13 at 15:43

score 3 · Answer 1 · answered Jun 02 '13 at 16:00

The best way to check if some text might be in some language is to try and parse it - embed the assembler in your application and invoke it. I strongly recommend that approach - even for assembly code the input can contain some special syntax or construction that you haven't thought of and you'll end up emitting a false negative.

This is especially true with assembly code - lexing and parsing it is very cheap compared to other languages, there's not much harm in doing it twice.

If you try to craft a fancy regex pattern yourself, you'll just end up duplicating the first stages of the assembler anyway, only you'll have to debug it yourself - it's better to go with a complete and tested solution.

score 1 · Accepted Answer · answered Jun 02 '13 at 14:33

For a decently accurate identification, checking that the lines match a regex will be okay. That's actually very similar to the first step of a compiler - the scanning phase - where the contents of the file are read and the tokens identified. The next step - the actual parsing is more complex (although not that complex for assembler).

An example of a regex would be something like this:

^[ \t]*((mov|xor|add|mul)[ \t]*([abcde][xhl]|[cd]s)[ \t]*,)|jmp[ \t]*([abcde][xhl]|[cd]s|[0-9A-F]*)[ \t]*$

It first checks the valid instructions with two parameters, then the existence of a parameter, followed by the alternative of single param instructions and then the existence of another parameter - including a numeric constant which is valid as the second param.

I can't imagine much assembly code where the first column is always blank. Labels are useful! — Ben Voigt, Jun 02 '13 at 14:38
You're right that the regex doesn't handle labels... it was just meant as a start - the real one is more complex. — Anders Abel, Jun 02 '13 at 14:41

Assembly language parser implementation

2 Answers2