How to analyze an xml string with 'utf-8' encoding efficiently in Python

Question

There is an HTTP proxy server on raspberry 3. The XML data is analyzed by BeautifulSoup(BS). I found that when the data contained only 'ASCII'-format characters, the BS was super fast. However, when some of the characters were beyond 'ASCII', the BS went extremely slow(for a 150kb XML string, it will take more than 10 seconds.) I also tried the elementtree structure and XML.dom. Both of them were slow. The XML.sax was much better, but for my python 2.7.13 on the raspberry board, the XML.sax could only deal with 'ASCII's. I have to use data.encode('ascii','ignore') before I use sax, but this also took long. I am just wondering is there a good way to deal with the utf8-format XML string?

No expert in python but `xml.sax.parseString(data.decode('utf-8'))` perhaps could work ? You decode to unicode instead. Also watch this awesome presentation: https://www.youtube.com/watch?v=Mx70n1dL534 — Niloct, Apr 07 '17 at 22:53
sax.parseString is a helper function and is different from parser.parse(). It can not support utf-8 format. This can be seen in the __init__.py in the sax folder: it imports the StringIO from cStringIO which can not deal with the unicode. — Yu Xuan, Apr 08 '17 at 13:37
http://stackoverflow.com/questions/1817695/python-how-to-get-stringio-writelines-to-accept-unicode-string — Niloct, Apr 08 '17 at 16:40

Yu Xuan · Answer 1 · 2017-04-14T20:57:36.557

To analyze an "xml" response in BS

response=requests.post(url)
soup=BeautifulSoup(response.text,'xml')

The

response.text

will decode the response content automatically and return it in "string". However, BS will try different possible decoding methods until it decodes successfully. This will take some time. (I guess "ascii" is in the first place of the decoding method list and that's the reason why content in "ascii" is decoded fast )

Use

response.encoding='utf-8'

before

response.text

to tell BS how to decode the response content. And it will go much faster.

How to analyze an xml string with 'utf-8' encoding efficiently in Python

1 Answers1