I am trying to parse the PostHistory.xml file from the stack exchange dump. My code looks like that:
import xml.etree.ElementTree as eTree
with open("PostHistory.xml", 'r') as xml_file:
xml_tree = eTree.parse(xml_file)
But I get:
UnicodeDecodeError: 'utf-8' codec can't decode
bytes in position 1959-1960: invalid continuation byte
I can read the text of the file like that:
with open("PostHistory.xml") as xml_file:
a = xml_file.readline()
The file * command returns this description for the file:
PostHistory.xml: XML 1.0 document, UTF-8 Unicode (with BOM) text,
with very long lines, with CRLF line terminators
Also the first line of the file confirms the UTF-8 encoding:
<?xml version="1.0" encoding="utf-8"?>
I tryed to add the parameterencoding="utf-8-sig"
but I got the same error again.
The size of the file is 112 Gb. Am I missing something here?