0

I am parsing xml data with lxml in python

The data looks like this:

string='''<?xml version="1.0" encoding="UTF-8"?>/n
    <div type="request" xml:base="/k-api/7728" xml:lang="en" >
    <div n="" type="request" xml:id="_54f59d0003">
        <p xml:id="_54f59d0004"/>
        <p xml:id="_54f59d0005">Requests </p>
    </div>
    <div n="0001" type="request" xml:id="_54f59d0006">
        <p xml:id="_54f59d0007">1.  First request.
        </p>
    </div>
    <div n="0002" type="claim" xml:id="_54f59d0008">
         <p xml:id="_54f59d0009">2. Second request.
         </p>
    </div>
    <div n="0003" type="request" xml:id="_54f59d0010">
         <p xml:id="_54f59d0011">3. Thrid requests.
         </p>
    </div>
    <div n="0004" type="request" xml:id="_54f59d0012">
        <p xml:id="_54f59d0013">4. request.
        </p>
    </div>
</div>'''


import xml.etree.ElementTree as ET
from lxml import etree
parser = etree.XMLParser(encoding="UTF-8", resolve_entities=False, strip_cdata=False, recover=True, ns_clean=True)
XML_tree = etree.fromstring(xml_string,parser=parser)

This does not work because several reasons a) the line break \n: I can solve that by

xml_string = ''.join(string.splitlines())

but I am wondering if there is a way to tell in the parser that lxml should not take care of line breaks b) Utf-8 first line in the string. I can also take care of it by:

xml_string = xml_string.replace('<?xml version="1.0" encoding="UTF-8"?>','')

before parsing, but is there a way to do it all inside the lxml parser?, i.e telling the parser to remove line breaks and to forget about the encoding (note: encoding="UTF-8" or encoding=None will not solve the problem)

Thanks

EDIT 1: The rror that I get when not removing the encoding bit is: ValueError: Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration.

JFerro
  • 3,203
  • 7
  • 35
  • 88
  • 2
    Your code works fine for me, using triple quotes around `string` and `XML_tree = etree.fromstring(string.encode('utf-8'), parser=parser)` – Maurice Meyer Mar 23 '21 at 11:27
  • 2
    Have a look at https://stackoverflow.com/questions/28534460/lxml-etree-xml-valueerror-for-unicode-string re the encoding. – yvesonline Mar 23 '21 at 11:46

1 Answers1

1

etree.fromstring() function should have the XML string input encoded as bytes to parse correctly if the XML fragment includes the XML declaration.

Alternatively, can use ElementTree.fromstring() function.

import xml.etree.ElementTree as ET
from lxml import etree

xml_string = '''<?xml version="1.0" encoding="UTF-8"?>
<div...>
</div>'''

parser = etree.XMLParser(encoding="UTF-8", resolve_entities=False, strip_cdata=False, recover=True, ns_clean=True)

# Option 1
root = etree.fromstring(xml_string.encode('utf-8'), parser)

# Option 2
root = ET.fromstring(xml_string, parser)

# do something with the parsed XML

pretty_xml = etree.tostring(root, pretty_print=True, encoding=str)
print(pretty_xml)
CodeMonkey
  • 22,825
  • 4
  • 35
  • 75