18

When I use xmltodict to load the xml file below I get an error: xml.parsers.expat.ExpatError: not well-formed (invalid token): line 1, column 1

Here is my file:

<?xml version="1.0" encoding="utf-8"?>
<mydocument has="an attribute">
  <and>
    <many>elements</many>
    <many>more elements</many>
  </and>
  <plus a="complex">
    element as well
  </plus>
</mydocument>

Source:

import xmltodict
with open('fileTEST.xml') as fd:
   xmltodict.parse(fd.read())

I am on Windows 10, using Python 3.6 and xmltodict 0.11.0

If I use ElementTree it works

tree = ET.ElementTree(file='fileTEST.xml')
    for elem in tree.iter():
            print(elem.tag, elem.attrib)

mydocument {'has': 'an attribute'}
and {}
many {}
many {}
plus {'a': 'complex'}

Note: I might have encountered a new line problem.
Note2: I used Beyond Compare on two different files.
It crashes on the file that is UTF-8 BOM encoded, and works om the UTF-8 file.
UTF-8 BOM is a sequence of bytes (EF BB BF) that allows the reader to identify a file as being encoded in UTF-8.

Damian
  • 4,395
  • 4
  • 39
  • 67
  • What is the exact traceback you get? I just tried doing what you showed, and it worked correctly for me, with either bytes or unicode (Python 3 string) as input. – cco Feb 16 '18 at 09:57
  • The error is: xml.parsers.expat.ExpatError: not well-formed (invalid token): line 1, column 1 – Damian Feb 17 '18 at 01:45

7 Answers7

17

I think you forgot to define the encoding type. I suggest that you try to initialize that xml file to a string variable:

import xml.etree.ElementTree as ET
import xmltodict
import json


tree = ET.parse('your_data.xml')
xml_data = tree.getroot()
#here you can change the encoding type to be able to set it to the one you need
xmlstr = ET.tostring(xml_data, encoding='utf-8', method='xml')

data_dict = dict(xmltodict.parse(xmlstr))
  • oddly, this worked for me without changing from the default encoding type, which is set to `'utf-8'` on both `ET` and `xmltodict` – Dave Kielpinski Feb 28 '20 at 18:22
5

In my case the file was being saved with a Byte Order Mark as is the default with notepad++

I resaved the file without the BOM to plain utf8.

jmunsch
  • 22,771
  • 11
  • 93
  • 114
4

Python 3

One Liner

data: dict = xmltodict.parse(ElementTree.tostring(ElementTree.parse(path).getroot()))

Helper for .json and .xml

I wrote a small helper function to load .json and .xml files from a given path. I thought it might come in handy for some people here:

import json
import xml.etree.ElementTree

def load_json(path: str) -> dict:  
    if path.endswith(".json"):
        print(f"> Loading JSON from '{path}'")
        with open(path, mode="r") as open_file:
            content = open_file.read()

        return json.loads(content)
    elif path.endswith(".xml"):
        print(f"> Loading XML as JSON from '{path}'")
        xml = ElementTree.tostring(ElementTree.parse(path).getroot())
        return xmltodict.parse(xml, attr_prefix="@", cdata_key="#text", dict_constructor=dict)

    print(f"> Loading failed for '{path}'")
    return {}

Notes

  • if you want to get rid of the @ and #text markers in the json output, use the parameters attr_prefix="" and cdata_key=""

  • normally xmltodict.parse() returns an OrderedDict but you can change that with the parameter dict_constructor=dict

Usage

path = "my_data.xml"
data = load_json(path)
print(json.dumps(data, indent=2))

# OUTPUT
#
# > Loading XML as JSON from 'my_data.xml' 
# {
#   "mydocument": {
#     "@has": "an attribute",
#     "and": {
#       "many": [
#         "elements",
#         "more elements"
#       ]
#     },
#     "plus": {
#       "@a": "complex",
#       "#text": "element as well"
#     }
#   }
# }

Sources

winklerrr
  • 13,026
  • 8
  • 71
  • 88
2

I had the same problem and I solved just specifying the encoding to the open function.

In this case it would be something like:

import xmltodict
with open('fileTEST.xml', encoding='utf8') as fd:
   xmltodict.parse(fd.read())
Travia
  • 21
  • 1
1

In my case, the issue was with the first 3 characters. So removing them worked:

import xmltodict
from xml.parsers.expat import ExpatError

with open('your_data.xml') as f:
    data = f.read()
    try:
        doc = xmltodict.parse(data)
    except ExpatError:
        doc = xmltodict.parse(data[3:])
Prayson W. Daniel
  • 14,191
  • 4
  • 51
  • 57
0

xmltodict seems to not be able to parse <?xml version="1.0" encoding="utf-8"?>

If you remove this line, it works.

Arount
  • 9,853
  • 1
  • 30
  • 43
  • Obviously it would work without the xml declaration; did it fail with it? For me, it didn't fail. – cco Feb 16 '18 at 10:00
  • For me it failed with xml declaration, but not without it.. You know, I try things before posting it here – Arount Feb 16 '18 at 10:02
  • I'm not trying to throw shade. I got a different result, and it wasn't clear to me that you had tried with the declaration. What versions are you using? I'm using Python 3.5.2, Expat 2.1.1 – cco Feb 16 '18 at 10:08
  • Python: `3.6.3`, xmltodict: `0.11.0`. – Arount Feb 16 '18 at 10:49
0

Not specific to the original post but for those who are also running into the same error at a different line, I was able to fix it by correcting the XML/XHTML error.

In my case, the document I was working with had a text description with an andpercent symbol "&" instead of "&" so to fix my issue, I had to edit the file first before running through the parser.

Brian
  • 41
  • 3