34

I'm transforming an xml document with xslt. While doing it with python3 I had this following error. But I don't have any errors with python2

-> % python3 cstm/artefact.py
Traceback (most recent call last):
  File "cstm/artefact.py", line 98, in <module>
    simplify_this_dataset('fisheries-service-des-peches.xml')
  File "cstm/artefact.py", line 85, in simplify_this_dataset
    xslt_root = etree.XML(xslt_content)
  File "lxml.etree.pyx", line 3012, in lxml.etree.XML (src/lxml/lxml.etree.c:67861)
  File "parser.pxi", line 1780, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:102420)
ValueError: Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration.

#!/usr/bin/env python3
# vim:fileencoding=UTF-8:ts=4:sw=4:sta:et:sts=4:ai
# -*- coding: utf-8 -*-

from lxml import etree

def simplify_this_dataset(dataset):
    """Create A simplify version of an xml file
    it will remove all the attributes and assign them as Elements instead
    """
    module_path = os.path.dirname(os.path.abspath(__file__))
    data = open(module_path+'/data/ex-fire.xslt')
    xslt_content = data.read()
    xslt_root = etree.XML(xslt_content)
    dom = etree.parse(module_path+'/../CanSTM_dataset/'+dataset)
    transform = etree.XSLT(xslt_root)
    result = transform(dom)
    f = open(module_path+ '/../CanSTM_dataset/otra.xml', 'w')
    f.write(str(result))
    f.close()
Papouche Guinslyzinho
  • 5,277
  • 14
  • 58
  • 101

3 Answers3

50
data = open(module_path+'/data/ex-fire.xslt')
xslt_content = data.read()

This implicitly decodes the bytes in the file to Unicode text, using the default encoding. (This might give wrong results, if the XML file isn't in that encoding.)

xslt_root = etree.XML(xslt_content)

XML has its own handling and signalling for encodings, the <?xml encoding="..."?> prolog. If you pass a Unicode string starting with <?xml encoding="..."?> to a parser, the parser would like to reintrepret the rest of the byte string using that encoding... but can't, because you've already decoded the byte input to a Unicode string.

Instead, you should either pass the undecoded byte string to the parser:

data = open(module_path+'/data/ex-fire.xslt', 'rb')

xslt_content = data.read()
xslt_root = etree.XML(xslt_content)

or, better, just have the parser read straight from the file:

xslt_root = etree.parse(module_path+'/data/ex-fire.xslt')
Wodin
  • 3,243
  • 1
  • 26
  • 55
bobince
  • 528,062
  • 107
  • 651
  • 834
12

You can also decode the UTF-8 string and encode it with ascii before passing it to etree.XML

 xslt_content = data.read()
 xslt_content = xslt_content.decode('utf-8').encode('ascii')
 xslt_root = etree.XML(xslt_content)
Josh Allemon
  • 842
  • 7
  • 8
  • 6
    Why would you encode it as ascii when the initial declaration suggests utf-8 possibility? – sandyp Jan 16 '18 at 05:09
  • 2
    I made it work by simply reencoding with the default options: xslt_content = data.read().encode() – Loki Dec 06 '19 at 14:38
9

I made it work by simply reencoding with the default options

xslt_content = data.read().encode()
Loki
  • 518
  • 5
  • 6