xml.parsers.expat.ExpatError: not well-formed (invalid token)

Question

When I use xmltodict to load the xml file below I get an error: xml.parsers.expat.ExpatError: not well-formed (invalid token): line 1, column 1

Here is my file:

<?xml version="1.0" encoding="utf-8"?>
<mydocument has="an attribute">
  <and>
    <many>elements</many>
    <many>more elements</many>
  </and>
  <plus a="complex">
    element as well
  </plus>
</mydocument>

Source:

import xmltodict
with open('fileTEST.xml') as fd:
   xmltodict.parse(fd.read())

I am on Windows 10, using Python 3.6 and xmltodict 0.11.0

If I use ElementTree it works

tree = ET.ElementTree(file='fileTEST.xml')
    for elem in tree.iter():
            print(elem.tag, elem.attrib)

mydocument {'has': 'an attribute'}
and {}
many {}
many {}
plus {'a': 'complex'}

Note: I might have encountered a new line problem.
Note2: I used Beyond Compare on two different files.
It crashes on the file that is UTF-8 BOM encoded, and works om the UTF-8 file.
UTF-8 BOM is a sequence of bytes (EF BB BF) that allows the reader to identify a file as being encoded in UTF-8.

What is the exact traceback you get? I just tried doing what you showed, and it worked correctly for me, with either bytes or unicode (Python 3 string) as input. — cco, Feb 16 '18 at 09:57
The error is: xml.parsers.expat.ExpatError: not well-formed (invalid token): line 1, column 1 — Damian, Feb 17 '18 at 01:45

score 17 · Accepted Answer · answered Aug 27 '19 at 09:28

17

I think you forgot to define the encoding type. I suggest that you try to initialize that xml file to a string variable:

import xml.etree.ElementTree as ET
import xmltodict
import json


tree = ET.parse('your_data.xml')
xml_data = tree.getroot()
#here you can change the encoding type to be able to set it to the one you need
xmlstr = ET.tostring(xml_data, encoding='utf-8', method='xml')

data_dict = dict(xmltodict.parse(xmlstr))

answered Aug 27 '19 at 09:28

Renz Paul Del Rosario

186
1
3

oddly, this worked for me without changing from the default encoding type, which is set to `'utf-8'` on both `ET` and `xmltodict` – Dave Kielpinski Feb 28 '20 at 18:22

score 5 · Answer 2 · answered Jun 06 '18 at 00:03

5

In my case the file was being saved with a Byte Order Mark as is the default with notepad++

I resaved the file without the BOM to plain utf8.

answered Jun 06 '18 at 00:03

jmunsch

22,771
11
93
114

winklerrr · Answer 3 · 2020-02-22T12:02:10.793

Python 3

One Liner

data: dict = xmltodict.parse(ElementTree.tostring(ElementTree.parse(path).getroot()))

Helper for `.json` and `.xml`

I wrote a small helper function to load .json and .xml files from a given path. I thought it might come in handy for some people here:

import json
import xml.etree.ElementTree

def load_json(path: str) -> dict:  
    if path.endswith(".json"):
        print(f"> Loading JSON from '{path}'")
        with open(path, mode="r") as open_file:
            content = open_file.read()

        return json.loads(content)
    elif path.endswith(".xml"):
        print(f"> Loading XML as JSON from '{path}'")
        xml = ElementTree.tostring(ElementTree.parse(path).getroot())
        return xmltodict.parse(xml, attr_prefix="@", cdata_key="#text", dict_constructor=dict)

    print(f"> Loading failed for '{path}'")
    return {}

Notes

if you want to get rid of the @ and #text markers in the json output, use the parameters attr_prefix="" and cdata_key=""
normally xmltodict.parse() returns an OrderedDict but you can change that with the parameter dict_constructor=dict

Usage

path = "my_data.xml"
data = load_json(path)
print(json.dumps(data, indent=2))

# OUTPUT
#
# > Loading XML as JSON from 'my_data.xml' 
# {
#   "mydocument": {
#     "@has": "an attribute",
#     "and": {
#       "many": [
#         "elements",
#         "more elements"
#       ]
#     },
#     "plus": {
#       "@a": "complex",
#       "#text": "element as well"
#     }
#   }
# }

Sources

score 2 · Answer 4 · answered Jul 27 '22 at 04:20

2

I had the same problem and I solved just specifying the encoding to the open function.

In this case it would be something like:

import xmltodict
with open('fileTEST.xml', encoding='utf8') as fd:
   xmltodict.parse(fd.read())

answered Jul 27 '22 at 04:20

Travia

21
1

score 1 · Answer 5 · answered May 28 '19 at 12:53

In my case, the issue was with the first 3 characters. So removing them worked:

import xmltodict
from xml.parsers.expat import ExpatError

with open('your_data.xml') as f:
    data = f.read()
    try:
        doc = xmltodict.parse(data)
    except ExpatError:
        doc = xmltodict.parse(data[3:])

score 0 · Answer 6 · answered Feb 16 '18 at 09:57

0

xmltodict seems to not be able to parse <?xml version="1.0" encoding="utf-8"?>

If you remove this line, it works.

answered Feb 16 '18 at 09:57

Arount

9,853
1
30
43

Obviously it would work without the xml declaration; did it fail with it? For me, it didn't fail. – cco Feb 16 '18 at 10:00
For me it failed with xml declaration, but not without it.. You know, I try things before posting it here – Arount Feb 16 '18 at 10:02
I'm not trying to throw shade. I got a different result, and it wasn't clear to me that you had tried with the declaration. What versions are you using? I'm using Python 3.5.2, Expat 2.1.1 – cco Feb 16 '18 at 10:08
Python: `3.6.3`, xmltodict: `0.11.0`. – Arount Feb 16 '18 at 10:49

score 0 · Answer 7 · answered Jan 17 '22 at 20:38

Not specific to the original post but for those who are also running into the same error at a different line, I was able to fix it by correcting the XML/XHTML error.

In my case, the document I was working with had a text description with an andpercent symbol "&" instead of "&" so to fix my issue, I had to edit the file first before running through the parser.

xml.parsers.expat.ExpatError: not well-formed (invalid token)

7 Answers7

Python 3

One Liner

Helper for `.json` and `.xml`

Sources

Linked

xml.parsers.expat.ExpatError: not well-formed (invalid token)

7 Answers7

Python 3

One Liner

Helper for .json and .xml

Sources

Linked

Helper for `.json` and `.xml`