0

When using libxml2 in python to parse HTML via htmlParseDoc it becomes transformed. For example,

Orig HTML: Order Number & OrderID & Was Approved

Becomes: Order Number & OrderID & Was Approved

Additional non-visible control characters also are transformed, thereby replacing "&" with the original characters do not make the before and after strings equivalent. (I checked this by dumping the strings in hex format.)

Anyone know how to either 1) prevent the transformation from occurring or 2) create a transformation in the other direction to retrieve the original.

Thanks in advance.

DaBler
  • 2,695
  • 2
  • 26
  • 46
AnonPyDev
  • 67
  • 7
  • You could use [htmlParser](http://stackoverflow.com/questions/2087370/decode-html-entities-in-python-string) – fredtantini Jan 16 '15 at 15:25
  • Thank you. htmlParser with decode and then lowercasing the string works for two of my cases. I will post updates with the remaining issues. – AnonPyDev Jan 16 '15 at 15:57
  • One should note that `A & B` isn't valid HTML in the first place. Besides that have a look [here](http://stackoverflow.com/questions/17423495/how-to-solve-ampersand-conversion-issue-in-xml). – dhke Aug 01 '16 at 14:18

0 Answers0