0

I'm parsing a html table using BS4 in python. Everything works fine and I'm able do identify all the elements that i need and print they. But the program stops working then I try to write the results into a text file. I get this error:

"UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 13: ordinal not in range(128)"

I have tried to use .encode('utf-8') in the writing command but I get something like this written : 31.61 

Here's what I'm running. I used code structure to parse another table and it worked. I appreciate if anyone can point me in the right direction.

from threading import Thread
import urllib2
import re
from bs4 import BeautifulSoup


url = "http://trackinfo.com/dog-racelines.jsp?page=1&runnername=Ww%20Gloriaestefan" 
myfile = open('base/basei/' + url[57:].replace("%20", " ").replace("%27","'") + '.txt','w+')
soup = BeautifulSoup(urllib2.urlopen(url).read())  
for tr in soup.find_all('tr')[0:]:
  tds = tr.find_all('td')
  if len(tds) >=0:
    print tds[0].text, ",", tds[4].text, ",", tds[7].text, ",", tds[12].text, ",", tds[14].text, ",", tds[17].text
    myfile.write(tds[0].text + ','+ tds[4].text + "," + tds[7].text + "," + tds[12].text + "," + tds[14].text + "," + tds[17].text)

myfile.close() 
Brian Tompsett - 汤莱恩
  • 5,753
  • 72
  • 57
  • 129
user3319895
  • 111
  • 2
  • 11
  • I've tried your code on both Windows 7 and Ubuntu and can't get the malformed character to appear; I just get a space after anything that's like 31. Maybe you should add more details of how you're running the code (OS, how you're running your script and how you're viewing the text file). Otherwise, you could try one of the suggestions in http://stackoverflow.com/questions/10993612/python-removing-xa0-from-string to just clobber the data into submission. – Steven Maude Apr 18 '14 at 17:31
  • I'm using the the python in the mac os snow leopard terminal – user3319895 Apr 18 '14 at 18:32
  • the problem is that the tags that im parsing (31.14 ) have this &nbsp. Which leave an space after the text and it's messing with the writing and generating the problem. i don know how to bypass or remove this &nbsp in the tag – user3319895 Apr 18 '14 at 19:34

1 Answers1

1

Code below works for me. I replaced the non-breaking space with a comma; this way you can use the output directly as a CSV (e.g. you can easily read into Excel or LibreOffice Calc).

import urllib2                                                                  
from bs4 import BeautifulSoup                                                   

url = "http://trackinfo.com/dog-racelines.jsp?page=1&runnername=Ww%20Gloriaestefan"
soup = BeautifulSoup(urllib2.urlopen(url).read())                               

with open('out.txt', 'w') as myfile:                                           
    for tr in soup.find_all('tr')[0:]:
        tds = tr.find_all('td')
        if len(tds) >= 0:
            stripped_tds = [tds[x].text.strip() for x in (0, 4, 7, 12, 14, 17)]
            out = ','.join(stripped_tds)
            out = out.replace(u'\xa0', ',')
            print out
            myfile.write(out + '\n')

(The with statement removes the need to explicitly call myfile.close(). It implicitly does this when the section of code inside the with is complete, even if it encounters an exception there.)

Content of out.txt:

2014-04-15,E5,31.28,7,6,32.18,C
2014-04-13,E6,31.07,2,4,31.64,B
2014-04-11,E6,31.21,6,6,32.53,B
2014-04-07,E7,30.93,5,7,32.31,B
2014-04-03,S1,30.82,3,2,31.23,
2014-03-30,E9,31.02,3,8,31.97,A
2014-03-28,E9,30.95,7,8,31.85,A
2014-03-23,E9,30.88,8,8,32.06,A
2014-03-21,E6,30.83,1,1,30.83,SB
2014-03-17,E5,31.14,1,1,31.14,C
2014-03-15,E5,31.00,4,4,31.62,C
2014-03-10,E3,31.46,4,1,31.46,D
2014-03-08,A3,31.79,4,5,32.23,D
2014-03-03,A6,31.20,3,5,31.81,D
2014-03-01,E3,31.61,3,3,31.88,D
Steven Maude
  • 856
  • 11
  • 21