2

I have a following string:

'(1:A & ((2:B | 3:C) & (4:D | 5:E))) | (6:F & 7:G)'

and I want to parse them into tree hierarchy object. 1:A, 2:B, 3:C should be in leafs and & or | in roots. Parentheses are important, so string from example should be converted into:

            |
         /    \
        /      \
       &        &
      / \      / \
     /   \    /   \
   1:A    &  6:F  7:G
         / \
        /   \
       /     \      
      |      |
     / \    / \
  2:B 3:C  4:D 5:E

I tried to use patter which had split the brackets, but I could not convert the result in a reasonable hierarchy. I ran out of ideas, maybe some of you will have some?

ajuszczak
  • 25
  • 3

3 Answers3

1

Reverse Polish Notation

Relevant question on StackOverflow

Community
  • 1
  • 1
emesx
  • 12,555
  • 10
  • 58
  • 91
0

I think this does it (at least for your example input).

First, I replace all X:Y instances with Leaf instances, store the Leaf in a cache list and replace the X:Y with the index of the Leaf in the cache.

Then, I repeatedly look for (L O R) patterns, create a Node instance for it (looking up L and R in the cache), add it to the cache, and replace it with it's cache index in the string.

Then you are left with a String 11 | 9 (or similar), which can be converted into a Node as before.

Hope that explains it. Might be a more sensible solution to use a proper parser generator if this is for anything serious, or will need extending in the future...

import groovy.transform.*

def input = '(1:A & ((2:B | 3:C) & (4:D | 5:E))) | (6:F & 7:G)'

def cache = []

// Remove all X:Y items into a lookup table
input = input.replaceAll( /\w+:\w+/ ) {
  cache << new Leaf( it )
  cache.size() - 1
}

// Then keep finding Nodes filling them with cached values and adding them into the cache
while( input ==~ /.*\(\d+ [&|] \d+\).*/ ) {
    input = input.replaceAll( /\(\d+ [&|] \d+\)/ ) {
        def (left,op,right) = it[ 1..-2 ].tokenize()
        cache << new Node( cache[ left.toInteger() ], op, cache[ right.toInteger() ] )
        cache.size() - 1
    }
}

// Then we're left with "11 | 9", so make the root node
def (left,op,right) = input.tokenize()
def tree = new Node( cache[ left.toInteger() ], op, cache[ right.toInteger() ] )

println tree

///////////////////////////////////////////////////////
// Classes to hold leafs and nodes

@Canonical
class Leaf {
    String value
    String toString( String prefix ) {
        "${prefix}${value}"
    }
}

@TupleConstructor
class Node {
    def left, op, right
    String toString( String prefix ) {
        """${prefix}left:
          |${prefix}${left.toString( prefix + '    ' )}
          |${prefix}op: $op
          |${prefix}right:
          |${prefix}${right.toString( prefix + '    ' )}""".stripMargin()
    }
    String toString() {
        toString( '' )
    }
}
tim_yates
  • 167,322
  • 27
  • 342
  • 338
0

An alternative solution using regex.

The idea is to recursively parse the input string matching leaf nodes and expressions in paranthesis. Each invokation of the parse method will match it's entire input and either return a node with the input value (if it is a leaf) or a node with two sub-nodes that are evaluated recursively (if it matches a pattern of "expression operator expression").

In its current form it requires highly regular input:

  • a leaf must be on the form digit:upperCaseCharacter with no parenthesis
  • one space between operator and expressions
  • an expression is either a leaf or something contained in parenthesis

Whenever the regex matches, all matching groups are collected and null values are removed. There will be 8 non-null groups (since the number of grouping parenthesis are always the same) and the inner groups will always be number 2, 3 and 5, where 3 is the operator and 2 and 5 are the input for the left and right nodes respectively.

The toString method simply writes out the expression in polish notation (op ex1 ex2).

class Node {
    String value
    Node left, right

    static leaf = /(\d:[A-Z])/
    static op = /([&|])/
    static par = /\((.+)\)/

    static parse(String text) {
        if (text ==~ "^$leaf\$") {
            return new Node(value: text)
        }
        def matcher = text =~ "^($leaf|$par) $op ($leaf|$par)\$"
        if (matcher.matches()) {
            def (l,o,r) = (matcher[0] - null)[2,3,5]
            return new Node(left: parse(l), value: o, right: parse(r))
        }
        throw new IllegalArgumentException("Irregular input: $text")
    }

    String toString() {
        left && right ? "$value $left $right" : value
    }
}

Example:

println Node.parse('1:A')
println Node.parse('1:A & 2:B')
println Node.parse('(1:A & 2:B) | 3:C')
println Node.parse('(1:A & ((2:B | 3:C) & (4:D | 5:E))) | (6:F & 7:G)')

Produces the following polish notation output:

1:A
& 1:A 2:B
| & 1:A 2:B 3:C
| & 1:A & | 2:B 3:C | 4:D 5:E & 6:F 7:G
Steinar
  • 5,860
  • 1
  • 25
  • 23
  • Almost everything what I need. I tried to change leaf pattern without success. In my case there could be something some complex than in example. Let's say uuid and word after : ( In case of word I can probably use \w but I stopped on uuid). Whenever I tried to change the regex getting an exception. – ajuszczak Mar 13 '14 at 17:14
  • Can you provide an example leaf? If so, I can give it a go. – Steinar Mar 13 '14 at 17:48
  • 682be428-25af-4a96-b057-0a9f28f40bcf:Sport or 54bc9a00-d030-4e63-9ded-a8c52063802d:Travel ( UUID v4 and word ) – ajuszczak Mar 13 '14 at 18:24
  • So, instead of matching a digit you want to match many chars with values 0..9, a..f and the hyphen -. Instead of uppercase letter you can match any word with \w+. All in all, this should do the trick: `leaf = /([0-9a-f-]+:\w+)/` – Steinar Mar 13 '14 at 18:48