0

I found a file containing xml like structure. I have no idea what it is but I need to parse it.

It looks like that:

<:openingtag>
    <:openingtag2>
        <:openingtag3>
            text    
        </>
    </>
</>

Does anybody has an idea how to parse it? Groovy / Java / Python are fine to implement the parser.

Federico klez Culloca
  • 26,308
  • 17
  • 56
  • 95
  • Can you share at least part of the file? – AMC Feb 06 '20 at 07:35
  • Please read https://stackoverflow.com/help/how-to-ask – KGS Feb 06 '20 at 07:38
  • The XML is not well-formed. No matter which language you use, it won't parse. I don't know of any libraries for parsing XML with permissive/loose format checking. Refer https://stackoverflow.com/questions/44765194/how-to-parse-invalid-bad-not-well-formed-xml – ou_ryperd Feb 06 '20 at 08:21
  • To tell you how to parse it, we would have to know the specification of the data structure, first. Writing the parser itself can then be literally done irrespective of the language you use. – Jan Held Feb 06 '20 at 08:33
  • This would probably be a good start https://www.javaworld.com/article/2077493/java-tip-128--create-a-quick-and-dirty-xml-parser.html but would need some work (probably a stack of last seen names) around closing tags being nameless – tim_yates Feb 06 '20 at 10:01

1 Answers1

2

A naive parser using petit - but of course leaves alot to cover, since the grammatic is unknown.

@Grab("com.github.petitparser:petitparser-core:2.2.0")
import org.petitparser.tools.GrammarDefinition
import org.petitparser.tools.GrammarParser
import org.petitparser.parser.primitive.CharacterParser as CP
import org.petitparser.parser.primitive.StringParser as SP
import org.petitparser.utils.Functions as F

class FakeMLGrammerDefinition extends GrammarDefinition {
    FakeMLGrammerDefinition() {
        define("start",
                ref("tag").trim())
        define("tag",
                ref("tag-start")
                .seq(ref("tag").star())
                .seq(ref("text").optional())
                .seq(ref("tag").star())
                .seq(ref("tag-end")))
        define("tag-start",
                SP.of('<:')
                .seq(ref("keyword"))
                .seq(SP.of(">"))
                .trim())
        define("tag-end",
                SP.of("</>")
                .trim())
        define("text",
                CP.pattern("^<").star().flatten().trim())
        define("keyword",
                CP.letter()
                .seq(CP.pattern("^>").plus())
                .star()
                .flatten())
    }

    /** Helper for `def`, which is a keyword in groovy */
    void define(s, p) { super.def(s,p) }
}

class FakeMLParserDefinition extends FakeMLGrammerDefinition {
    FakeMLParserDefinition() {
        action("tag", { tag, c1, t, c2, _ -> 
                [(tag): [children: c1+c2, text: t]]
        })
        action("tag-start", { it[1] })
    }
}

class FakeMLParser extends GrammarParser {
    FakeMLParser() {
        super(new FakeMLParserDefinition())
    }
}

println(new FakeMLParser().parse("""
<:openingtag>
    <:openingtag2>
        <:openingtag3>
            text
        </>
    </>
</>
"""))
// Success[9:1]: {openingtag={children=[{openingtag2={children=[{openingtag3={children=[], text=text}}], text=}}], text=}}
cfrick
  • 35,203
  • 6
  • 56
  • 68
  • I'm impressed. This really works. :-) Thanks! But that brings me to the question how to access the data in the Parsing Result. Could you please give me an additional hint how to iterate thrue the structure? Thanks! – Henning Feb 06 '20 at 13:37
  • the result of `parse` you can call `get()` on and get the nested maps. from there is's just basic groovy. e.g. `result.get().openingtag.children.openingtag2.children.openingtag3.text` (minus some nesting, that is wrong). since you are looking for a groovy answer i assume you are familiar with groovy. – cfrick Feb 06 '20 at 19:19