0

I am currently using SAX to parse some HTML. However, I now have to a parse a document that has something like this:

`<OPTION VALUE="123" SELECTED>`

and because SELECTED does not have an actual value set, it is throwing an error (not well-formed, invalid token). Is there a way to resolve this so I can keep using SAX?

My code:

        SAXParserFactory spf = SAXParserFactory.newInstance();
        SAXParser sp = spf.newSAXParser();
        XMLReader xr = sp.getXMLReader();

        xr.setContentHandler(sch);
        InputSource is = new InputSource(Statics.SUBJECT_CODE_URL);
        xr.parse(is);
al.
  • 212
  • 1
  • 2
  • 6

2 Answers2

1

You can't use SAX to parse HTML. HTML is not XML. A perfectly valid HTML document is NOT a valid XML document, and nothing you can do will make an XML parser parse it.

Jim Garrison
  • 85,615
  • 20
  • 155
  • 190
0

With SAX you could parse XHTML, but you cannot parse HTML with a great success, because HTML is not a well-formed XML.

Konstantin Yovkov
  • 62,134
  • 8
  • 100
  • 147