I have downloaded Stackoverflow June 2013 data dump and now in the process of parsing the XML files and storing in MySQL database. I am using Python ElementTree to do it and it keeps crashing and giving me encoding errors.
Snippet of parse code:
post = open('a.xml', 'r')
a = post.read()
tree = xml.parse((a).encode('ascii', 'ignore')) # I also tried .encode('utf-8').strip() it doesn't work
#Get the root node
row = tree.findall("row")
It's giving me following errors:
'ascii' codec can't encode character u'\u2019' in position 248: ordinal not in range(128)
I also tried using the following but the problem persists.
.encode('ascii', 'ignore')
Any advise to fix the problem will be appreciated. Also, if anyone has link to the clean data will also help.
Also, my final goal is to convert the data into RDF, so if anyone has StackOverflow data dump in RDF format, I'll be grateful.
Thanks in advance!
p.s This is the XML row that causes problem and crashes the program:
<row Id="99" PostTypeId="2" ParentId="88" CreationDate="2008-08-01T14:55:08.477" Score="2" Body="<blockquote>
 <p>The actual resolution of gettimeofday() depends on the hardware architecture. Intel processors as well as SPARC machines offer high resolution timers that measure microseconds. Other hardware architectures fall back to the system’s timer, which is typically set to 100 Hz. In such cases, the time resolution will be less accurate. </p>
</blockquote>

<p>I obtained this answer from <a href="http://www.informit.com/guides/content.aspx?g=cplusplus&amp;seqNum=272" rel="nofollow">High Resolution Time Measurement and Timers, Part I</a></p>" OwnerUserId="25" LastActivityDate="2008-08-01T14:55:08.477" />
Edit: @Arjan the solution you mentioned here doesn't work for me.