8

I am reading an xml file using Python. But my xml file contains & characters, because of which while running my Python code, it gives the following error:

xml.parsers.expat.ExpatError: not well-formed (invalid token):

Is there a way to ignore the & check by python?

Anne
  • 26,765
  • 9
  • 65
  • 71
SyncMaster
  • 9,754
  • 34
  • 94
  • 137
  • 1
    possible duplicate of [How do I escape ampersands in XML?](http://stackoverflow.com/questions/1328538/how-do-i-escape-ampersands-in-xml) – James Black Nov 14 '11 at 00:12
  • @James: not really, since the question is about how to parse something that's almost but not quite XML, not how to create XML properly in the first place. – Wooble Nov 14 '11 at 00:14
  • 3
    Do you have control over whatever abomination is creating the original "XML" file so you can make it actually give you valid XML? – Wooble Nov 14 '11 at 00:14
  • Then XML file actually is not well-formed and any conforming XML parser shouldn't parse it. Can't you fix the source to produce actual XML? – svick Nov 14 '11 at 00:14
  • Unfortunately it is not a well formed xml file. It is a text file with tags. So I thought accessing it in the form of xml file would be easier to process the data. – SyncMaster Nov 14 '11 at 00:22
  • @pragadheesh - So just replace all the ampersands with the three ampersand replacements as in the question I mentioned, then do it as XML. – James Black Nov 14 '11 at 01:53

2 Answers2

8

No, you can't ignore the check. Your 'xml file' is not an XML file - to be an XML file, the ampersand would have to be escaped. Therefore, no software that is designed to read XML files will parse it without error. You need to correct the software that generated this file so that it generates proper ("well-formed") XML. All the benefits of using XML for interchange disappear entirely if people start sending stuff that isn't well-formed and people receiving it try to patch it up.

Michael Kay
  • 156,231
  • 11
  • 92
  • 164
  • 5
    This just is not right. 1) There is a lot of soft which do parse such a file - any internet browser does, as well as IDEs like Xcode. 2) You cannot ask people to fix the soft which produces the XML just because in general case it's 3rd party soft. – LiMar Jun 24 '13 at 09:22
  • There may be software products that can parse such files, but such a software product is not an XML parser. Conformant XML parsers are required to report all errors in XML files. Internet browsers, as far as I am able to establish, correctly reject an file served as XML if it contains an unescaped ampersand. – Michael Kay Jun 24 '13 at 13:23
  • 3
    And when software is generating bad XML, then fixing it is the right solution. Generating bad XML is the same as generating a proprietary format of your own invention - there's no point in adopting a standard and then not implementing it properly. – Michael Kay Jun 24 '13 at 13:26
2

For me adding the line "<?xml version='1.0' encoding='iso-8859-1'?>" in front the string is did the trick.

>>> text = '''<?xml version="1.0" encoding="iso-8859-1"?>
    ... <seuss><fish>red</fish><fish>blu\xe9</fish></seuss>'''
>>> doc = elementtree.ElementTree.fromstring(text)

Refer this page https://mail.python.org/pipermail/tutor/2006-November/050757.html

Kumar
  • 237
  • 3
  • 9