4

As the title says, I'm trying to parse for example

term(A, b, c(d, "e", 7))

in a Lua table like

{term, {A, b, {c, {d, "e", 7}}}}

This is the grammar I built:

local pattern = re.compile[=[
  term      <- variable / function
  argument  <- variable / lowercase /number / string
  function  <- {|lowercase {|(open argument (separator (argument / function))* close)?|}|}
  variable  <- uppercase
  lowercase <- {[a-z][A-Za-z0-9]*}
  uppercase <- {[A-Z][A-Za-z0-9]*}
  string    <- '"' {~ [^"]* ~} '"'
  number    <- {[0-9]+}
  close     <- blank ")"
  open      <- "(" blank
  separator <- blank "," blank
  blank     <- " "*
]=]

I'm having the following problems:

  • It can't parse nested terms. For the example above it returns only {term, {} } (while it's ok with term(A, b, c)).
  • To strip the quotes from the strings I used {~ ~}, but because of that I had to move all the captures from argument and term in the rows below. Is there a way to avoid this?
  • I'd like to have a key associated with each element to specify its type, for example instead of A something like {value = "A", type = "variable"}. I found a way to do this with {:name: :} but, the order of the elements in the table is lost (because it doesn't create a new table but simply adds a key, in this case variable="A" and the order of this elements is not fixed). How can I tag the items maintaining the order?
キキジキ
  • 1,443
  • 1
  • 25
  • 44

2 Answers2

6

In your grammar you have:

argument  <- variable / lowercase /number / string
function  <- {|lowercase {|(open argument (separator (argument / function))* close)?|}|}

Keep in mind that lpeg tries to match the patterns/predicates in the rule in the order you have it. Once it finds a match lpeg won't consider further possible matches in that grammar rule even if there could be a "better" match later on.

Here it fails to match nested function calls because it sees that c can match

`argument  <- variable`

Since your variable non-terminal is listed before function, lpeg doesn't consider the latter and so it stops parsing the tokens that comes after.

As an experiment, I've modified your grammar slightly and added some table&named captures for most of the non-terminals you're interested in.

local pattern = re.compile
[=[
  term      <- {| {:type: '' -> "term" :} term_t |}
  term_t    <- func / var
  func      <- {| {:type: '' -> "func":} {:name: func_id:} "(" arg(separator arg)* ")" |}
  func_id   <- lower / upper
  arg       <- number / string / term_t
  var       <- {| {:type: '' -> "var" :} {:name: lower / upper:} |}
  string    <- '"' {~ [^"]* ~} '"'
  lower <- {%l%w*}
  upper <- {%u%w*}
  number    <- {%d+}
  separator <- blank "," blank
  blank     <- " "*
]=]

With a quick pattern test:

local test = [[fun(A, b, c(d(42), "e", f, 7))]]
dump( pattern:match(test) )

Which gives the following output on my machine:

{
  {
    {
      type = "var",
      name = "A"
    },
    {
      type = "var",
      name = "b"
    },
    {
      {
        "42",
        type = "func",
        name = "d"
      },
      "e",
      {
        type = "var",
        name = "f"
      },
      "7",
      type = "func",
      name = "c"
    },
    type = "func",
    name = "fun"
  },
  type = "term"
}

Looking carefully at the above, you'll notice that the function arguments appear in the index part of the table in the order that they were passed in. OTOH the type and name can appear in any order since it's in the associative part of the table. You can wrap those "attributes" in another table and put that inner attribute table in the index part of the outer table.

Edit: Here's a revised grammar to make the parse a bit more uniform. I've removed the term capture to help prune some unnecessary branches.

local pattern2 = re.compile
[=[
  term      <- term_t
  term_t    <- func / var
  func      <- {| {:type: '' -> "func":} {:name: func_id:} "(" args? ")" |}
  func_id   <- lower / upper
  arg       <- number / string / term_t
  args      <- arg (separator args)?
  var       <- {| {:type: '' -> "var" :} {:name: lower / upper:} |}
  string    <- {| {:type: '' -> "string" :}'"' {:value: [^"]* :} '"' |}
  lower     <- {%l%w*}
  upper     <- {%u%w*}
  number    <- {| {:type: '' -> "number":} {:value: %d+:} |}
  separator <- blank "," blank
  blank     <- " "*
]=]

Which yields the following:

{
  {
    type = "var",
    name = "A"
  },
  {
    type = "var",
    name = "b"
  },
  {
    {
      {
        type = "number",
        value = "42"
      },
      type = "func",
      name = "d"
    },
    {
      type = "string",
      value = "e"
    },
    {
      type = "var",
      name = "f"
    },
    {
      type = "number",
      value = "7"
    },
    type = "func",
    name = "c"
  },
  type = "func",
  name = "fun"
}
greatwolf
  • 20,287
  • 13
  • 71
  • 105
  • Thank you very much! Pegs have been a bit confusing for me. I will adopt your grammar and continue from there. Is there a way to obtain something like this -> http://pastebin.com/m3udvahC ? – キキジキ Jul 27 '13 at 05:27
  • @キキジキ you mean giving the `term` branch a name? That's probably possible by tweaking the grammar somehow. Note that the way the grammar's currently defined, `fun` is parsed as a function and so it's a child of `term`. The `term` itself doesn't actually have a name here. What name should `term` take on? Should it steal its first child's name? What if that child doesn't have a name? – greatwolf Jul 27 '13 at 05:32
  • @キキジキ Another idea is to just cut the `term` capture out altogether. In this case, all you would have left in the AST are functions, variables and other primitive terminals like number, strings etc. This is probably ok since it doesn't look like `term` is really adding any other information in the example. – greatwolf Jul 27 '13 at 05:39
  • I'd like to have more uniform elements. In "c", "e" and "7" appears like that but d and f are tables. What I want to do is that each element is represented the same. For example since "42" is an argument of "d", there should be a table of d with its name and type, and then a nested table with a list of its arguments like {d, {42}}. With type and name labels, it should be {name="d", type="fun", {{name="42", type="num"}}}. It's ok also if type and name are at specified indexes, like {fun, d{{num, 42}}}. About term, I happened to call it "term" but it can be something else. – キキジキ Jul 27 '13 at 05:48
  • This is slightly different from the one in pastebin because I forgot to wrap arguments in a table there. – キキジキ Jul 27 '13 at 05:50
  • @キキジキ I've added a grammar revision. See if that fits. – greatwolf Jul 27 '13 at 06:13
  • 1
    Great! Thank you for taking time to answer my question. – キキジキ Jul 27 '13 at 06:29
2

Sorry, I didn't have experience with LPeg, but usual Lua patterns are enough to easily solve your task:

local str = 'term(A, b, c(d, "e", 7))'

local function convert(expr)
    return (expr:gsub('(%w+)(%b())',
        function (name, par_expr)
            return '{'..name..', {'..convert(par_expr:sub(2, -2))..'}}'
        end
    ))
end

print(convert(str))  -- {term, {A, b, {c, {d, "e", 7}}}}

Now just load() converted string to create a table.

Egor Skriptunoff
  • 23,359
  • 2
  • 34
  • 64
  • Well, you're right it's a very nice solution! If I find a way to make pegs work I can obtain more data (like type of element) and eventually expand it to add operators, but if it's too complicated probably I'll go with your solution. – キキジキ Jul 27 '13 at 05:30