0

I have had some troubles to parse a XML from a string directly into an Element. I a have an xml file that I have transform into a string:

resp = requests.post(request_url, request_string,   proxies=urllib.getproxies(), stream=True)

And as recommended here: https://stackoverflow.com/a/25023776/1551810, I used the contenet instead of the text:

response_tree = ET.fromstring(resp.content)

I apparently have a Syntax erro in the XML file :

XMLSyntaxError: Input is not proper UTF-8, indicate encoding !
Bytes: 0xB0 0x20 0x4E 0x6F, line 12, column 35

I tried this to encode the content but to no avail:

ET.fromstring(resp.content.encode('utf8'))

I have the same XMLSYntaxError than before. Can anyone help me? I already have spent two hours on this.

Community
  • 1
  • 1
Saltigué
  • 59
  • 2
  • 8
  • 2
    The comment says that it is *not* utf-8, so you need to figure out in what encoding the data actually is, and then either transcode (decode from encoding, then encode into utf-8), or specify a proper xml-header – deets May 18 '15 at 10:33
  • Thanks, for you quick answer, The data is a string and I decoded it. I now have a UnicodeDecodeError. Can you please develop your idea? – Saltigué May 18 '15 at 10:38
  • Very similar to this PHP problem: http://stackoverflow.com/questions/2507608/error-input-is-not-proper-utf-8-indicate-encoding-using-phps-simplexml-lo; as @deets suggested, you need to get your encoding in order – seanhodges May 18 '15 at 10:55

1 Answers1

1

I finally found a great library that helped me to solve the problem: cchardet(https://pypi.python.org/pypi/cchardet/0.3.5) And I followed @deets advice.

import cchardet
charac_coding_desired = 'UTF-8'
encoding = cchardet.detect(resp.content)['encoding']
if charac_coding_desired != encoding:
    strg= resp.content.decode(encoding, resp.content).encode(charac_coding_desired)

Now I can parse brutally the string:

ET.fromstring(strg)

Thanks anyway!!!

Saltigué
  • 59
  • 2
  • 8
  • Good work solving the problem and posting back your results. After a short delay you should now be able to mark your own answer as solved. Please do this and you will help other people who are having the same problem. – seanhodges May 18 '15 at 16:48