There is an HTTP proxy server on raspberry 3. The XML data is analyzed by BeautifulSoup(BS). I found that when the data contained only 'ASCII'-format characters, the BS was super fast. However, when some of the characters were beyond 'ASCII', the BS went extremely slow(for a 150kb XML string, it will take more than 10 seconds.) I also tried the elementtree structure and XML.dom. Both of them were slow. The XML.sax was much better, but for my python 2.7.13 on the raspberry board, the XML.sax could only deal with 'ASCII's. I have to use data.encode('ascii','ignore') before I use sax, but this also took long. I am just wondering is there a good way to deal with the utf8-format XML string?
Asked
Active
Viewed 1,137 times
-1
-
No expert in python but `xml.sax.parseString(data.decode('utf-8'))` perhaps could work ? You decode to unicode instead. Also watch this awesome presentation: https://www.youtube.com/watch?v=Mx70n1dL534 – Niloct Apr 07 '17 at 22:53
-
sax.parseString is a helper function and is different from parser.parse(). It can not support utf-8 format. This can be seen in the __init__.py in the sax folder: it imports the StringIO from cStringIO which can not deal with the unicode. – Yu Xuan Apr 08 '17 at 13:37
-
http://stackoverflow.com/questions/1817695/python-how-to-get-stringio-writelines-to-accept-unicode-string – Niloct Apr 08 '17 at 16:40
1 Answers
1
To analyze an "xml" response in BS
response=requests.post(url)
soup=BeautifulSoup(response.text,'xml')
The
response.text
will decode the response content automatically and return it in "string". However, BS will try different possible decoding methods until it decodes successfully. This will take some time. (I guess "ascii" is in the first place of the decoding method list and that's the reason why content in "ascii" is decoded fast )
Use
response.encoding='utf-8'
before
response.text
to tell BS how to decode the response content. And it will go much faster.

Yu Xuan
- 17
- 6