Using HTML Parser with SGML

Question

I wanted to use an XML parser with an SGML document, however this doesnt work. After reading some suggestions it seems to only way around this is to use a HTML parser. So im basically just trying to do a simple query that will extract the story title from my document. (It works if I parse null - prints the whole document, just im not sure how to access a specific tag e.g. title).

public static void main(String[] args){
    Parser parser = new Parser(xmlFile.getAbsolutePath());
    NodeList list = parser.parse (new HasAttributeFilter ("id","title"));
    Node node = list.elementAt(0);

    if (node instanceof TagNode) {
       TagNode meta = (TagNode) node;
       String description = meta.getAttribute("title");
       System.out.println(description);
    }
}

Start of SGML file:

<head>
<meta words=61 rate=180>
<formname>Testing</formname>
<storyid>1234</storyid>
</head>
<story>
<fields>
<f id=title>Sports</f>
<f id=modify-by>Tester</f>
<f id=modify-date>315576000</f>
</fields>
<body>

XML and HTML are both related to SGML, but they both aren't compatible. Why not use a SGML parser? http://stackoverflow.com/questions/4867894/sgml-parser-in-java — Philipp, Feb 14 '13 at 14:59
I read that thread earlier and there was no definitive answer as to how to use an SGML parser, if you can suggest one then great. All the suggestions seemed to lead to a HTML Parser. BTW, i tried the SAX Parser and that failed — maloney, Feb 14 '13 at 15:01

score 1 · Answer 1 · answered Feb 18 '13 at 12:07

From your example it seems that your content model is very simple. In that case you could implement a simple ad hoc parsing.

If you are very sure that marked sections are not used (not only because of CDATA sections, but also because the status keyword could be given in parameter entities, which would further complicate everything), and that esoteric features of SGML (such as DATATAG) are not being used, you could just remove any comment and then scan for the following pattern:

(?i)<f\s+id\s*=\s*["']?title["']?\s*>

Which leaves you at the beginning of the content, assuming that f has a single attributes id (and that the start-tag is not minimized, since it could be unclosed or net-enabling). Then scan until "<", and voilà.

Of course, something more flexible certainly requires an SGML parser.

Using HTML Parser with SGML

1 Answers1