12

I have essentially the same question as PEG for Python style indentation, but I'd like to get a little more direction regarding this answer.

The answer successfully generates an array of strings that are each line of input with 'INDENT' and 'DEDENT' between lines. It seems like he's pretty much used PEG.js to tokenize, but no real parsing is happening.

So how can I extend his example to do some actual parsing?

As an example, how can I change this grammar:

start = obj
obj = id:id children:(indent obj* outdent)?
    {
        if (children) {
            let o = {}; o[id] = children[1];
            return o;
        } else {
            return id;
        }
    }
id = [a-z]
indent = '{'
outdent = '}'

to use indentation instead of braces to delineate blocks, and still get the same output?

(Use http://pegjs.majda.cz/online to test that grammar with the following input: a{bcd{zyx{}}})

Nils Lindemann
  • 1,146
  • 1
  • 16
  • 26
Trevor Dixon
  • 23,216
  • 12
  • 72
  • 109

2 Answers2

21

Parser:

// do not use result cache, nor line and column tracking

{ var indentStack = [], indent = ""; }

start
  = INDENT? l:line
    { return l; }

line
  = SAMEDENT line:(!EOL c:. { return c; })+ EOL?
    children:( INDENT c:line* DEDENT { return c; })?
    { var o = {}; o[line] = children; return children ? o : line.join(""); }

EOL
  = "\r\n" / "\n" / "\r"

SAMEDENT
  = i:[ \t]* &{ return i.join("") === indent; }

INDENT
  = &(i:[ \t]+ &{ return i.length > indent.length; }
      { indentStack.push(indent); indent = i.join(""); pos = offset; })

DEDENT
  = { indent = indentStack.pop(); }

Input:

a
  b
  c
  d
    z
    y
    x

Output:

{
   "a": [
      "b",
      "c",
      {
         "d": [
            "z",
            "y",
            "x"
         ]
      }
   ]
}

It cannot parse an empty object (last x), however, it should be easy to solve. Trick here is the SAMEDENT rule, it succeeds when indentation level hasn't changed. INDENT and DEDENT change current indentation level without changing position in text pos = offset.

Boaz Yaniv
  • 6,334
  • 21
  • 30
Jakub Kulhan
  • 1,572
  • 2
  • 16
  • 37
  • I really appreciate this anwser. How do you come up with this method please? – jiyinyiyong Dec 07 '12 at 08:40
  • If I copy/paste this directly into http://pegjs.majda.cz/online it does not compile. And after some tweaking it is not clear how to"fix" it. – Clearly Feb 15 '13 at 22:53
  • Just tested this, the snippet compiles and produce the expected output just fine. Not sure what errors you hit there. – chakrit Apr 13 '13 at 16:32
  • 1
    How does this handle a backtracking case? If you have a set of precendence rules with multiple possible subexpressions, wouldn't indent.stack be screwed up by a series of legal matches within the first subexpression, followed by chaos as the second subexpression is parsed? – Elf Sternberg May 03 '13 at 15:11
  • I tried this in http://pegjs.majda.cz/online (pegjs v 0.8) where it errors, but in http://peg.arcanis.fr/1PvOax/ (pegjs v 0.7) it works. – joeriks Jan 14 '14 at 05:44
  • 1
    @joeriks: I've fixed to work on v0.8.0, if it's still helpful. – Boaz Yaniv Mar 03 '14 at 13:53
5

Update 2021

Here is a working example which runs in the online playground of Peggy.js. Peggy.js is a fork of PEG.js under active development. PEG.js was discontinued by David Maida.

The example shows, how the INDENT, SAMEDENT and DEDENT rules are parsed, and how to use parsing locations. Check the console log.

It uses these syntaxes, which may not be known from other parser generators:

(top of file)

  • {{...}} (Global initializer) – Run ... on parser generation.
  • {...} (Per-parse initializer) – Run ... on parser instantiation.

(in-file)

  • X {...} (action) – Do ... when X succeeds. Variables from the initializers are available. If ... returns something, it will replace what X returns.
  • $X – Return the raw text parsed with X, instead of the result of X.
  • ... @X ... (pluck operator) – Replace the result of ... X ... with the result of X.
  • X &{...} (predicate) – "and ... also needs to be true for X to succeed".
  • X = &(...) – If ... succeeds, X succeeds. ... consumes no input.

See the docs for more information.

{{
    console.clear()
    console.log('Parser generated')
}}

{
    let indentstack = []
    let indent = ''
    function found (what) {
        let loc = location()
        console.log(`[${loc.start.line}:${loc.start.column} - ${loc.end.line}:${loc.end.column}] found ${what}`)
    }
    console.log('Parser instantiated')
}

DOCUMENT = NEWLINES? @THINGS NEWLINES? _

THINGS = ( SAMEDENT @( OBJECT / LINE ) )*

OBJECT = key:KEY childs:(BLOCK / INLINE) {
    found(`object "${key}"`)
    let o = {}
    o[key] = childs
    return o
}

KEY = @$( [^ \t\r\n:]+ ) _ ':' _

BLOCK = NEWLINES INDENT @THINGS DEDENT

INLINE = line:LINE { return [line] }

LINE = text:$( (!EOL .)+ ) NEWLINES? {
    found(`line "${text}"`)
    return text
}

INDENT = &(
    spaces:$( [ \t]+ ) &{
        return spaces.length > indent.length
    } {
        indentstack.push(indent)
        indent = spaces
    }
) {
    found('indent')
}

SAMEDENT = spaces:$( [ \t]* ) &{
    return spaces === indent
} {
    found('samedent')
}

/* Because of this rule, results cache must be disabled */
DEDENT = &{
    indent = indentstack.pop()
    return true
} {
    found('dedent')
}

_ = [ \t]*
EOL = '\r\n' / '\n' / '\r'
NEWLINES = (_ EOL)+

/* Test with this input

H:
  a
  b
  c
  G:
    d
    e
    f

*/

Old Answer

Here is a fix for @Jakub Kulhan´s grammar which works in PEG.js v 0.10.0. The last line needs to be changed to = &{ indent = indentStack.pop(); return true;} because PEG.js now does not allow standalone actions ({...}) in a grammar anymore. This line is now a predicate (&{...}) which always succeeds (return true;).

i also removed the pos = offset; because it gives an error offset is not defined. Probably Jakub was referring to some global variable available in older versions of PEG.js. PEG.js now provides the location() function which returns an object which contains offset and other information.

// do not use result cache, nor line and column tracking

{ var indentStack = [], indent = ""; }

start
  = INDENT? l:line
    { return l; }

line
  = SAMEDENT line:(!EOL c:. { return c; })+ EOL?
    children:( INDENT c:line* DEDENT { return c; })?
    { var o = {}; o[line] = children; return children ? o : line.join(""); }

EOL
  = "\r\n" / "\n" / "\r"

SAMEDENT
  = i:[ \t]* &{ return i.join("") === indent; }

INDENT
  = &(i:[ \t]+ &{ return i.length > indent.length; }
      { indentStack.push(indent); indent = i.join(""); })

DEDENT
  = &{ indent = indentStack.pop(); return true;}

Starting with v 0.11.0 PEG.js also supports the Value Plucking operator, @ which would allow writing this grammar even simpler, but as it is currently not in the online parser I will refrain from adding it to this example.

Nils Lindemann
  • 1,146
  • 1
  • 16
  • 26