I am parsing xml data with lxml in python
The data looks like this:
string='''<?xml version="1.0" encoding="UTF-8"?>/n
<div type="request" xml:base="/k-api/7728" xml:lang="en" >
<div n="" type="request" xml:id="_54f59d0003">
<p xml:id="_54f59d0004"/>
<p xml:id="_54f59d0005">Requests </p>
</div>
<div n="0001" type="request" xml:id="_54f59d0006">
<p xml:id="_54f59d0007">1. First request.
</p>
</div>
<div n="0002" type="claim" xml:id="_54f59d0008">
<p xml:id="_54f59d0009">2. Second request.
</p>
</div>
<div n="0003" type="request" xml:id="_54f59d0010">
<p xml:id="_54f59d0011">3. Thrid requests.
</p>
</div>
<div n="0004" type="request" xml:id="_54f59d0012">
<p xml:id="_54f59d0013">4. request.
</p>
</div>
</div>'''
import xml.etree.ElementTree as ET
from lxml import etree
parser = etree.XMLParser(encoding="UTF-8", resolve_entities=False, strip_cdata=False, recover=True, ns_clean=True)
XML_tree = etree.fromstring(xml_string,parser=parser)
This does not work because several reasons a) the line break \n: I can solve that by
xml_string = ''.join(string.splitlines())
but I am wondering if there is a way to tell in the parser that lxml should not take care of line breaks b) Utf-8 first line in the string. I can also take care of it by:
xml_string = xml_string.replace('<?xml version="1.0" encoding="UTF-8"?>','')
before parsing, but is there a way to do it all inside the lxml parser?, i.e telling the parser to remove line breaks and to forget about the encoding (note: encoding="UTF-8" or encoding=None will not solve the problem)
Thanks
EDIT 1: The rror that I get when not removing the encoding bit is: ValueError: Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration.