Parsing text structured as tree with fixed width columns using parslet in ruby

Question

I'm stuck. For couple of days been trying to parse this text (look at bottom). But can't figure out some things. Firstly text is formatted in tree structure with fixed width columns but exact column width depends on widest field.

I'm using ruby, first I tried Treetop gem and made some progress, but then decided to try Parslet so I'm using it now and it seems should be easier with it, but it's hard to find detailed documentation for it.

currently I parse each line individually and create array with parsed entries, but that's not correct as I loose structure. I need parse it recursively and handle depth.

I would really appreciate any tips, ideas, suggestions.

here's my current code, it works, but all data is flattened. my current idea is to parse recursively if current line start position is bigger than previous ones (ie. width) thus it means we should go in deeper level. Actually I managed to make it such but then I couldn't get outside properly so I've removed that code.

require 'pp'
require 'parslet'
require 'parslet/convenience'


class TextParser < Parslet::Parser
    @@width = 5

    root :text

    rule(:text)   { (line >> newline).repeat }

    rule(:line) { left >> ( topline | subline ).as(:entry) }

    rule(:topline) {
        float.as(:number) >> str('%') >> space >> somestring.as(:string1) >> space >> specialstring.as(:string2) >> space >> specialstring.as(:string3)
    }

    rule(:subline) {
        dynamic { |source, context|
            width = context.captures[:width].to_s.length
            width = width-1 if context.captures[:width].to_s[-1] == '|'
            if width > @@width
                # should be recursive
                result = ( specialline | lastline | otherline | empty )
            else
                result = ( specialline | lastline | otherline | empty )
            end
            @@width = width
            result
        }
    }

    rule(:otherline) {
        somestring.as(:string1)
    }

    rule(:specialline) {
        float.as(:number) >> str('%') >> dash >> space? >> specialstring.as(:string1)
    }

    rule(:lastline) {
        float.as(:number) >> str('%') >> dash >> space? >> str('[...]')
    }

    rule(:empty) {
        space?
    }

    rule(:left) {  seperator.capture(:width) >> dash?.capture(:dash) >> space? }

    rule(:somestring) { match['0-9A-Za-z\.\-'].repeat(1) }
    rule(:specialstring) { match['0-9A-Za-z&()*,\.:<>_~'].repeat(1) }

    rule(:space) { match('[ \t]').repeat(1) }
    rule(:space?) { space.maybe }
    rule(:newline) { space? >> match('[\r\n]').repeat(1) }

    rule(:seperator) { space >> (str('|') >> space?).repeat }
    rule(:dash) { space? >> str('-').repeat(1) }
    rule(:dash?) { dash.maybe }

    rule(:float)   { (digits >> str('.') >> digits) }
    rule(:digits)   { match['0-9'].repeat(1) }

end

parser = TextParser.new

file = File.open("text.txt", "rb")
contents = file.read.to_s
file.close

pp parser.parse_with_debug(contents)

text looks like this (https://gist.github.com/davispuh/4726538)

 1.23%  somestring  specialstring                    specialstring
        |
        --- specialstring
           |          
           |--12.34%-- specialstring
           |          specialstring
           |          |          
           |          |--12.34%-- specialstring
           |          |          specialstring
           |          |          |          
           |          |          |--12.34%-- specialstring
           |          |           --1.12%-- [...]
           |          |          
           |           --2.23%-- specialstring
           |                     |          
           |                     |--12.34%-- specialstring
           |                     |          specialstring
           |                     |          specialstring
           |                     |          |          
           |                     |          |--12.34%-- specialstring
           |                     |          |          specialstring
           |                     |          |          specialstring
           |                     |           --1.23%-- [...]
           |                     |          
           |                      --1.23%-- [...]
           |                                 
            --1.05%-- [...]

 1.23%  somestring  specialstring                    specialstring
 2.34%  somestring  specialstring                    specialstring  
        |
        --- specialstring
            specialstring
            specialstring
           |          
           |--23.34%-- specialstring
           |          specialstring
           |          specialstring
            --34.56%-- [...]

        |
        --- specialstring
            specialstring
           |          
           |--12.34%-- specialstring
           |          |          
           |          |--100.00%-- specialstring
           |          |          specialstring
           |           --0.00%-- [...]
            --23.34%-- [...]

thanks :)

What is generating that text, and does it have alternate ways of outputting the data? The output you show, and are trying to parse, isn't really designed for parsing. Instead it is to help visualize something. That's great for human eyes, but stinks for data-reuse. — the Tin Man, Feb 07 '13 at 06:25
for what downvote? I don't see anything bad with this question. — davispuh, Feb 07 '13 at 06:26
exactly, that's why I need to parse it and sadly but I can't get any other output :( — davispuh, Feb 07 '13 at 06:30
You still didn't say what generates that file. Also, where does it get the data? — the Tin Man, Feb 07 '13 at 06:53
Dont think it matters if I'm interested in parsing but ok it's from program performance analyses generated by `perf` — davispuh, Feb 07 '13 at 08:17
Looking through the "[perf tutorial](https://perf.wiki.kernel.org/index.php/Tutorial#Sample_analysis_with_perf_report)", it looks like there are other, more readable ways to output the data, in particular "[Machine Readable Output](https://perf.wiki.kernel.org/index.php/Tutorial#Machine_readable_output)". I haven't played with it, but it seems like it'd be a lot easier path. — the Tin Man, Feb 07 '13 at 14:47

score 2 · Accepted Answer · edited May 23 '17 at 11:48

I was going to say the same thing as "the Tin Man". There has to be another format you can generate the data in.

If you want to parse this however... Parslet works like a map/reduce algorythm. You're first pass (parsing) is not intended to give you your final output, just to capture all the information you need from your source document.

Once you have that stored in a tree, you can then transform it to get the output you want.

So... I would write a parser that records each white space as a node, aswell as matching the text and percentages you need. I would group the white space nodes in an "indentation" node.

I would then use a transform to replace the whitespace nodes with a count of nodes to calculate the indentations.

Remember: Parslet generates a standard ruby hash. You can then write whatever code you like to make sense of this tree.

The parser is just converting the text file into a data-stucture you can manipulate.

Just to reiterate though. I think "the Tin Man" has the right answer.. generate the data in a machine readable way instead.

Update:

For an alternative approach you can check out: Indentation sensitive parser using Parslet in Ruby?

Parsing text structured as tree with fixed width columns using parslet in ruby

I would really appreciate any tips, ideas, suggestions.

1 Answers1