0

I am trying to parse an XML using SAX Parser but keep getting XML document structures must start and end within the same entity. which is expected as the XML doc I get from other source won't be a proper one. But I don't want this exception to be raised as I would like to parse an XML document till I find the <myTag> in that document and I don't care whether that doc got proper starting and closing entities.

Example:

<employeeDetails>
  <firstName>xyz</firsName>
  <lastName>orp</lastName>
  <departmentDetails>
  <departName>SALES</departName>
  <departCode>982</departCode>...

Here I don't want to care whether the document is valid one or not as this part is not in my hand. So I would like to parse this document till I see <departName> after that I don't want to parse the document. Please suggest me how to do this. Thanks.

kjhughes
  • 106,133
  • 27
  • 181
  • 240
user1653027
  • 789
  • 1
  • 16
  • 38

2 Answers2

2

You cannot use an XML parser to parse a file that does not contain well-formed XML. (It does not have to be valid, just well-formed. For the difference, read Well-formed vs Valid XML.)

By definition, XML must be well-formed, otherwise it is not XML. Parsers in general have to have some fundamental constraints met in order to operate, and for XML parsers, it is well-formedness.

Either repair the file manually first to be well-formed XML, or open it programmatically and parse it as a text file using traditional parsing techniques. An XML parser cannot help you unless you have well-formed XML.

See also

kjhughes
  • 106,133
  • 27
  • 181
  • 240
  • Thanks for the information. Good to know that we can't parse incomplete XML using parser. I will try to do it in traditional parsing way as you said. Thanks. – user1653027 Jan 12 '15 at 19:38
0

BeautifulSoup in Python can handle incomplete xml really well. I use it to parse prefix of large XML files for preview.

>>> from bs4 import BeautifulSoup
>>> BeautifulSoup('<a><b>foo</b><b>bar<','xml')
<?xml version="1.0" encoding="unicode-escape"?>\n<a><b>foo</b><b>bar</b></a>