A server I don't control sends broken XML with characters such as '>', '&', '<' etc. in attributes and text.
A small sample:
<StockFormula Description="" Name="F_ΔTURN" RankType="Higher" Scope="Universe" Weight="10.86%">
<Formula>AstTurnTTM>AstTurnPTM</Formula>
</StockFormula>
<Composite Name="Piotroski & Trends - <11@4w600k 70b" Weight="0%" RankType="Higher">
</Composite>
I settled on using the lxml module because it is case sensitive, is very fast and gets the job done.
How would I go about fixing up this type of XML? Basically, I am trying to replace all occurrences of the invalid characters with the proper escape sequences.
import re
broken = '<StockFormula Description="" Name="F_ΔTURN" RankType="Higher" Scope="Universe" Weight="10.86%">\n<Formula>AstTurnTTM>AstTurnPTM</Formula>\n<Composite Name="Piotroski & Trends - <11@4w600k 70b" Weight="0%" RankType="Higher">\n</Composite>'
print re.sub(r'(.*Name=".*)&(")', r'\g<1>>\g<2>', broken)
Output:
<StockFormula Description="" Name="F_ÃŽâ€TURN" RankType="Higher" Scope="Universe" Weight="10.86%">
<Formula>AstTurnTTM>AstTurnPTM</Formula>
</StockFormula>
<Composite Name="Piotroski & Trends - <11@4w600k 70b" Weight="0%" RankType="Higher">
</Composite>