1

I am trying to parse the PostHistory.xml file from the stack exchange dump. My code looks like that:

import xml.etree.ElementTree as eTree
with open("PostHistory.xml", 'r') as xml_file:
    xml_tree = eTree.parse(xml_file)

But I get:

UnicodeDecodeError: 'utf-8' codec can't decode 
bytes in position 1959-1960: invalid continuation byte

I can read the text of the file like that:

with open("PostHistory.xml") as xml_file:
     a = xml_file.readline()

The file * command returns this description for the file:

PostHistory.xml: XML 1.0 document, UTF-8 Unicode (with BOM) text, 
with very long lines, with CRLF line terminators

Also the first line of the file confirms the UTF-8 encoding:

<?xml version="1.0" encoding="utf-8"?>

I tryed to add the parameterencoding="utf-8-sig" but I got the same error again.

The size of the file is 112 Gb. Am I missing something here?

Anoroah
  • 1,987
  • 2
  • 20
  • 31
  • If the XML file is 112GB, you should use [`iterparse()`](https://docs.python.org/3/library/xml.etree.elementtree.html#xml.etree.ElementTree.iterparse) instead. I also wouldn't open the file first; just use the path/filename in iterparse. – Daniel Haley Mar 15 '19 at 15:24

2 Answers2

2

You can try something like this:

    with open(posts_path) as xml_file:  
        for line in xml_file:            
            try:                    
                xml_obj = eTree.fromstring(line)                    
            except UnicodeDecodeError as e:
                # Dealing with corrupted encoded strings
                new_str = line.encode("latin-1", "ignore")
                xml_obj1 = eTree.fromstring(ww)

So when you get invalid characters you will encode them as "latin-1"

0

The reality of the file's bytes may contradict the encoding specified in the XML declaration. (Just setting the encoding in the XML declaration won't change the rest of the bytes in the file.)

You can try

open("PostHistory.xml", 'r', encoding="ISO-8859-1")

but you may have to roll up your sleeves and repair the errant byte at 1959-1960 if it's a data corruption rather than an file-wide encoding problem.

See also:

kjhughes
  • 106,133
  • 27
  • 181
  • 240