1

I have two XML-files containing a "ß" ("scharfes S" in german), starting with:

    <?xml version="1.0" encoding="utf-16" standalone="yes"?>

and

I used the following code to read the utf-8 file:

    with open(file.xml, encoding='utf-8') as file:  
        f = file.read()
        xml = xmltodict.parse(f) 

and this code for the utf-16 file.

    with open(file.xml, encoding='utf-16') as file:  
        f = file.read()
        xml = xmltodict.parse(f) 

for the UTF-16 file I get this error: UnicodeError: UTF-16 stream does not start with BOM. Changing everything to:

    with open(file.xml, encoding='utf-16') as file:  
        file.seek(1, os.SEEK_SET) 
        f = file.read()
        xml = xmltodict.parse(f) 

where I tried different points (e.g. seek(1,..), seek(2,..), ... ) doesn't help.

Then I checked the encoding with (Source)

   alias vic="vim -c 'execute \"silent \!echo \" . &fileencoding | q'"
   vic file.xml
   > latin-1

(Therefore I replaced encoding='utf-16' to encoding='latin-1').

But now I get errors about the "ß" in the code (e.g. when trying "utf-16-le")

  "'utf-16-le' codec can't decode bytes in position 12734-12735: illegal encoding"

Does someone know where the problem is here? Or in general: How can I read XML files in Python with utf-8 or utf-16 encoding without having BOM errors or errors about the character "ß".

Thank you in advance!

wuiwuiwui
  • 499
  • 4
  • 13

2 Answers2

3

If I create a UTF-16LE file:

$ echo 'Character is: ß' | iconv -t utf-16le >f.txt

and examine it with a hex dump:

$ xxd f.txt 
00000000: 4300 6800 6100 7200 6100 6300 7400 6500  C.h.a.r.a.c.t.e.
00000010: 7200 2000 6900 7300 3a00 2000 df00 0a00  r. .i.s.:. .....

and then read it in Python:

>>> open('f.txt', encoding='utf-16LE').read()
'Character is: ß\n'

then I get the expected results.

Your file is not correctly encoded with the encoding that you're declaring.

can't decode bytes in position 12734-12735: illegal encoding

Create a much smaller sample file, or generate one as suggested above and look for differences.

Joe
  • 29,416
  • 12
  • 68
  • 88
3

If you find yourself messing with the file encoding manually when handling XML files, you're doing something wrong.

Fundamental rule: Never read XML files with open() in text mode.

Use an XML parser to load the file. The parser will sort out the encoding for you automatically. That's the whole point of having an XML declaration like <?xml version="1.0" encoding="utf-16"?> at the top of the file .

import xml.etree.ElementTree as ET

tree = ET.parse('file.xml')

If you want to use xmltodict, open the file in binary mode (rb):

with open('file.xml', 'rb') as f:  
    xml = xmltodict.parse(f)

Here, xmltodict will give the file to an XML parser internally, which again will sort out the encoding for you.


If the above mangles characters or even throws errors, your XML file is broken. Fix the producer of the file. If you've edited the file manually, double check that your text editor's encoding settings match the XML declaration.

Tomalak
  • 332,285
  • 67
  • 532
  • 628
  • Thank you very much! I found a solution by using Joe's solution. But your answer still helps me very much. I'll definitely try to use a parser and see how it goes! – wuiwuiwui Nov 26 '20 at 13:08
  • 1
    @wuiwuiwui The point is that you still will mess up the file when you `open()` it as `encoding='utf-16LE'` and it happens to be something else. So don't use Joe's solution. Use the second suggestion in my answer. – Tomalak Nov 26 '20 at 13:12
  • Thank you, I'll definitely do! – wuiwuiwui Nov 26 '20 at 14:38