1

I'm currently writing my own parser for a fictional Assembly language. The instructions are very similar to any normal assembly instruction:

[INSTRUCTION] [OP]*

where op can be 0-3 operands. I want to be able to use an expression that matches this. This is being written in C++ with boost::regex. I myself am a regexp noobie, trying to understand the boost documentation of what each symbol does.

Now, I already have an expression that can match 0-3 operands like so:

Sample Instructions:
    MOVI 8 10
    ADDI 8 8 10
    NOP
    BNEZI -1

Expression: ^([a-z]+)( ([-,0-9]+))*

However, I can't create a suitable expression that handles the same instructions when comma-delimited:

Sample Instructions:
    MOVI 8, 10
    ADDI 8, 8, 10

This is really tripping me up. I tried rewriting my expression like so:

^([a-z]+)( ([-,0-9]+))*(, ([-,0-9]+))*

This looks to be extremely green, poor regexp. It also isn't working correctly. I was thinking of using a recursive expression, but I looked at the documentation and I might as well scribble "overkill" on my forehead.

I realize I could just format the line to take out all the commas, but I would rather like to be able to write and understand a regexp expression first, then do it the easy way. Any help would be appreciated.

IAE
  • 2,213
  • 13
  • 37
  • 71
  • 1
    I'd actually recommend *not* using regular expressions for this. Afterall, what happens when you want to allow identifiers for memory locations, and registers? Instead dive into a [lexer generator of some kind](http://stackoverflow.com/search?q=c%2B%2B+lexer), and while you're at it consider using a [parser generator](http://stackoverflow.com/search?q=c%2B%2B+parser+generator). See also [general compiler methods](http://stackoverflow.com/q/1669/2509). – dmckee --- ex-moderator kitten Apr 16 '11 at 18:52
  • @dmckee: I was thinking of using Boost.Spirit to create a parser, but I considered that overkill. Although we have labels and identifiers, I figured the syntax was overall not complex enough to warrant something like that. I'll be sure to look into it though! Maybe it is the better solution. – IAE Apr 16 '11 at 18:55
  • I know the feeling, I avoiding using those tools for a long time. And the first one *will* be a little bit of a struggle, but you'll be happy you did it. – dmckee --- ex-moderator kitten Apr 16 '11 at 19:00
  • Can the delemiter between the numbers be space or comma, or one or the other within the group? –  Apr 16 '11 at 19:01
  • @sln: The instructions are comma delimited after the first value. So `ADD 10, 8, 8`. Is this what you meant? They are not space-delimited. I made a practice expression and actually forgot I had to check for commas. That's why the first version does not use commas; I just forgot! xD – IAE Apr 16 '11 at 19:04

5 Answers5

4

A string like:

ADDI 8, 8, 10

can be matched by a regex like this:

[a-zA-Z]+[ \t]+-?[0-9]+([ \t]*,[ \t]*-?[0-9]+)*

A (short) explanation:

[a-zA-Z]+   # match an instruction
[ \t]+      # match one or more spaces or tabs
-?[0-9]+    # match an integer value with an option minus sign in front of it
(           # open group 1
  [ \t]*    #   match zero or more spaces or tabs
  ,         #   match a comma
  [ \t]*    #   match zero or more spaces or tabs
  -?[0-9]+  #   match an integer value with an option minus sign in front of it
)*          # close group 1, and repeat it zero or more times

Having said all that, I must agree with dmckee's comment: a proper parser is the way to go, even if this is just a fictional language you're parsing.

Bart Kiers
  • 166,582
  • 36
  • 299
  • 288
  • Same as I came up with (although mine had the comma suffixed instead of prefixed) :) – Jon Grant Apr 16 '11 at 18:54
  • Thanks for the explanation! I would have been searching through the documentation quite a bit to understand that bit of expression :) – IAE Apr 16 '11 at 18:56
1

try

^([a-z]+)( ([-,0-9]+)((,|\s)[-,0-9]+)*)*
gouki
  • 4,382
  • 3
  • 20
  • 20
1

There is a bison grammar as part of HLA assembler source, which you can get here

http://webster.cs.ucr.edu/AsmTools/HLA/frozen.html

I suggest using a proper grammar :)

sehe
  • 374,641
  • 47
  • 450
  • 633
  • I will one up this, but keep in mind that this is, strictly speaking, not what the question had asked and can't truly be an answer :) – IAE Apr 16 '11 at 19:08
  • I don't mind. Sometimes (more often than you'd think) the question asked is not really what the person is looking for :) – sehe Apr 16 '11 at 19:09
1

If your serious about doing this, you have to account for every single line, all boundry conditions, errors, etc. Its not enough just to match a certian form, while not knowing anything about other forms.

If using regex, a cleaner way is to reserve a fixed number of buffers that can be analyzed. This is fairly complex for a rudimentary first level syntax checker.

This is a start:

/
  ^
     (?!\s*$)
     \s* 
     (?|
         ([a-zA-Z]+)
         \s*
         ((?<=\s)
          -?\d+|) (?!,\s*$) (?:,\s*|\s*$)  
         (-?\d+|) (?!,\s*$) (?:,\s*|\s*$)
         (-?\d+|) \s*$
         ()
       |
         ()()()()(.+)
     )
  $
/x;

A perl test case, using newline delimeted lines:

use strict;
use warnings;

my $rx = qr/
  ^
     (?!\s*$)
     \s* 
     (?|
         ([a-zA-Z]+)
         \s*
         ((?<=\s)
          -?\d+|) (?!,\s*$) (?:,\s*|\s*$)  
         (-?\d+|) (?!,\s*$) (?:,\s*|\s*$)
         (-?\d+|) \s*$
         ()
       |
         ()()()()(.+)
     )
  $/x;

my $cnt = 0;  # line counter

while ( my $line = <DATA> )
{
    ++$cnt;

    if ( $line =~ /$rx/ )
    {
        if (length $5) {
           print "\nSyntax error ? (line $cnt)   '$5'\n";
        }
        else {
           print "\nInstruction:  '$1'\n";
           print "     op1 = '$2'\n";
           print "     op2 = '$3'\n";
           print "     op3 = '$4'\n";
        }
    }
}

__DATA__

    MOVI 8 10
    ADDI 8 8 10

    NOP
    BNEZI -1
    InstA  0
    InstB  1,
    InstC  2,3, 4
    InstD  5,6, 7, 8

    MOVI 7, 8
    ADDI 9, 10, 11
    ADDI 12, 13 14

Output:

Syntax error ? (line 2)   'MOVI 8 10'

Syntax error ? (line 3)   'ADDI 8 8 10'

Instruction:  'NOP'
     op1 = ''
     op2 = ''
     op3 = ''

Instruction:  'BNEZI'
     op1 = '-1'
     op2 = ''
     op3 = ''

Instruction:  'InstA'
     op1 = '0'
     op2 = ''
     op3 = ''

Syntax error ? (line 8)   'InstB  1,'

Instruction:  'InstC'
     op1 = '2'
     op2 = '3'
     op3 = '4'

Syntax error ? (line 10)   'InstD  5,6, 7, 8'

Instruction:  'MOVI'
     op1 = '7'
     op2 = '8'
     op3 = ''

Instruction:  'ADDI'
     op1 = '9'
     op2 = '10'
     op3 = '11'

Syntax error ? (line 14)   'ADDI 12, 13 14'

Alternative, all the data is in a string (same regex).

while ( $str =~ /$rx/mg ) { }

  • I will need a while to understand the expression, but thanks for the test cases and tipps. – IAE Apr 16 '11 at 22:39
0

you ca try this :

^[\t ]*(?:([.A-Za-z0-9_] )[:])?(?:[\t ]*([A-Za-z]{2,4})(?:[\t ] (\[([A-Za-z0-9_] (([- ])[0-9] )?)\]|\". ?\"|\'. ?\'|[.A-Za-z0-9_] )(?:[\t ]*[,][\t ]*(\[([A-Za-z0-9_] (([- ])[0-9] )?)\]|\". ?\"|\'. ?\'|[.A-Za-z0-9_] ))?)?)?

can match label, OP, params

Jeremy Talus
  • 125
  • 2
  • 11