1

A server I don't control sends broken XML with characters such as '>', '&', '<' etc. in attributes and text.

A small sample:

<StockFormula Description="" Name="F_ΔTURN" RankType="Higher" Scope="Universe" Weight="10.86%">
    <Formula>AstTurnTTM>AstTurnPTM</Formula>
</StockFormula>
<Composite Name="Piotroski & Trends - <11@4w600k 70b" Weight="0%" RankType="Higher">
</Composite>

I settled on using the lxml module because it is case sensitive, is very fast and gets the job done.

How would I go about fixing up this type of XML? Basically, I am trying to replace all occurrences of the invalid characters with the proper escape sequences.

import re

broken = '<StockFormula Description="" Name="F_ΔTURN" RankType="Higher" Scope="Universe" Weight="10.86%">\n<Formula>AstTurnTTM>AstTurnPTM</Formula>\n<Composite Name="Piotroski & Trends - <11@4w600k 70b" Weight="0%" RankType="Higher">\n</Composite>'
print re.sub(r'(.*Name=".*)&(")', r'\g<1>&gt;\g<2>', broken)

Output:

<StockFormula Description="" Name="F_ÃŽâ€TURN" RankType="Higher" Scope="Universe" Weight="10.86%">
    <Formula>AstTurnTTM>AstTurnPTM</Formula>
</StockFormula>
<Composite Name="Piotroski & Trends - <11@4w600k 70b" Weight="0%" RankType="Higher">
</Composite>
ChaimG
  • 7,024
  • 4
  • 38
  • 46

1 Answers1

3

First, realize no XML parser can help you with "broken XML." XML parsers only operate over XML, which by definition must be well-formed.

Second, it is not possible to repair "broken XML" in the general case. There are no rules governing "broken XML." Without a clear definition of "broken XML," you cannot be guaranteed to be able to process it and convert it to real XML.

That said, HTML Tidy does a decent job of repairing (X)HTML, and it has limited capabilities for repairing XML as well. It's your best bet for automated repair of "broken XML." There is a Python package, PyTidyLib, which wraps the HTML Tidy library.

Community
  • 1
  • 1
kjhughes
  • 106,133
  • 27
  • 181
  • 240
  • I installed PyTidyLib using pip. But when I tried to run the [Small example of use](http://countergram.com/open-source/pytidylib/docs/index.html) I got AttributeError: `'Tidy' object has no attribute '_tidy'` – ChaimG Sep 28 '16 at 01:46
  • Carefully follow the instructions for [**Installing HTML Tidy**](http://countergram.com/open-source/pytidylib/docs/index.html#installing-html-tidy). If that doesn't help, I have no further guidance to offer. Perhaps ask a new question about getting PyTidyLib to work, or try using HTML Tidy directly without PyTidyLib. Good luck. – kjhughes Sep 28 '16 at 02:37