5

I have a number of XML files I'd like to process with a script, converting them from whatever encoding that they're in to UTF-8.

Using the code given in this great answer I can do the conversion, but how can I read the encoding given in the XML header?

For example, I have many files which are already in UTF-8, which should be left alone:

<?xml version="1.0" encoding="utf-8"?>

However, I have a lot of files which do need to be converted:

<?xml version="1.0" encoding="windows-1255"?>

How can I detect the XML encoding specified in the headers of these files in Python? Better, after I detect and reencode the files, how then can I change this XML header to read "utf-8" to avoid processing it in the future?

Community
  • 1
  • 1
Naftuli Kay
  • 87,710
  • 93
  • 269
  • 411

3 Answers3

5

Use lxml to do the parsing; you can then access the original encoding with:

from lxml import etree

with open(filename, 'r') as xmlfile:
    tree = etree.parse(xmlfile)
    if tree.docinfo.encoding == 'utf-8':
        # already in correct encoding, abort
        return

You can then use lxml to write the file out again in UTF-8.

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
  • Is there a way to do this with Python's standard library? – Piotr Dobrogost Feb 19 '16 at 12:37
  • @PiotrDobrogost: You could try to use the [expat parser](https://docs.python.org/2/library/pyexpat.html#module-xml.parsers.expat), but it'll be a lot more work if you need to re-code whole documents. – Martijn Pieters Feb 19 '16 at 13:45
  • I was interested only in getting encoding declaration itself. However according to [docs](https://docs.python.org/2/library/pyexpat.html#xml.parsers.expat.ParserCreate) *Expat doesn’t support as many encodings as Python does, and its repertoire of encodings can’t be extended; it supports UTF-8, UTF-16, ISO-8859-1 (Latin1), and ASCII.* and I need to process documents with ISO-8859-2 encoding. The truth is Python still does not have adequate tools for XML in the standard library. – Piotr Dobrogost Feb 19 '16 at 14:11
  • Thinking more about this the fact that Expat does not support parsing of ISO-8859-2 encoded XML file does not necessarily mean it can't handle getting encoding declaration alone as this is SAX parser and the xml declaration comes first in the file. As it turns out it is possible – see my [answer](http://stackoverflow.com/a/35508885/95735). – Piotr Dobrogost Feb 19 '16 at 15:29
2

How can I detect the XML encoding specified in the headers of these files in Python?

A solution by Rob Wolfe using only the standard library:

from xml.parsers import expat

s = """<?xml version='1.0' encoding='iso-8859-1'?>
       <book>
           <title>Title</title>
           <chapter>Chapter 1</chapter>
       </book>"""

class MyParser(object):
    def XmlDecl(self, version, encoding, standalone):
        print "XmlDecl", version, encoding, standalone

    def Parse(self, data):
        Parser = expat.ParserCreate()
        Parser.XmlDeclHandler = self.XmlDecl
        Parser.Parse(data, 1)

parser = MyParser()
parser.Parse(s)
Linkid
  • 517
  • 5
  • 17
Piotr Dobrogost
  • 41,292
  • 40
  • 236
  • 366
  • 1
    But, as stated in your comments to my answers, expat doesn't actually *support* decoding codecs other than UTF-8, UTF-16, ISO-8859-1, and by extension, ASCII. So don't try to use this to *recode* documents in other codecs, because you'll get a Mojibake instead. – Martijn Pieters Feb 19 '16 at 15:32
  • @MartijnPieters Right, this is only for reading encoding declaration. – Piotr Dobrogost Feb 19 '16 at 15:35
0

I want to expand on @PiotrDobrogost's answer and actually write a class that retrieves XML document encoding:

from xml.parsers import expat

class XmlParser(object):
    '''class used to retrive xml documents encoding
    '''

    def get_encoding(self, xml):
        self.__parse(xml)
        return self.encoding

    def __xml_decl_handler(self, version, encoding, standalone):
        self.encoding = encoding

    def __parse(self, xml):
        parser = expat.ParserCreate()
        parser.XmlDeclHandler = self.__xml_decl_handler
        parser.Parse(xml)

And here is the example of its usage:

xml = """<?xml version='1.0' encoding='iso-8859-1'?>
    <book>
        <title>Title</title>
        <chapter>Chapter 1</chapter>
    </book>"""
parser = XmlParser()
encoding = parser.get_encoding(xml)
Mykhailo Seniutovych
  • 3,527
  • 4
  • 28
  • 50