Beautiful Soup wont extract specific HTML

Question

I am using Windows 7 with Pycharm and Notebook++ With the Python code below I have managed to extract the data required from the following HTML

    <div class="resultsBlock">
        <ul class="header">
            <li class="first essential fin">Fin</li>
            <li class="essential greyhound">Greyhound</li>
            <li class="trap">Trap</li>
            <li class="sp">SP</li>
            <li class="timeSec">Time/Sec.</li>
            <li class="timeDistance">Time/Distance</li>
        </ul>

The code extracts finishing position, name trap, sp, timeSec and timeDistance, and places the information into a csv file. Just above this code in the Source is the following HTML

  <div class="resultsBlockHeader clearfix mediumRoundedCorners">
    <div class="track">Belle Vue&nbsp;|&nbsp;</div>
    <div class="date">23/08/15</div>
    <div class="datetime">13:51&nbsp;|&nbsp;</div>
    <div class="grade">A7&nbsp;|&nbsp;</div>
    <div class="distance">470m&nbsp;|&nbsp;</div>
    <div class="prizes">1st £56, Others £20 (BGRF added £30)</div>

So effectively from my Python code I replace this:

one = bsObj.findAll("li", {"class": "first essential fin"})

with this

track = bsObj.findAll("div", {"class": "track"})

However when I do this Python ignores it and doesn't even furnish me with any messages telling me why it has ignored this code. Below is the code in full and at the beginning I have just placed one line of code that attempts to extract the div class line.Any suggestion appreciated.

import csv
from urllib import urlopen
from bs4 import BeautifulSoup

html = urlopen("http://www.gbgb.org.uk/resultsRace.aspx?id=1793467" )
bsObj = BeautifulSoup(html)

track = bsObj.findAll("div", {"class": "track"})#the line that wont work

one = bsObj.findAll("li", {"class": "first essential fin"})
two = bsObj.findAll("li", {"class": "essential greyhound"})

four = bsObj.findAll("li", {"class": "timeDistance"})
five = bsObj.findAll("li", {"class": "trap"})
six = bsObj.findAll("li", {"class": "sp"})
seven = bsObj.findAll("li", {"class": "timeSec"})
eight = bsObj.findAll("li", {"class": "essential trainer"})
nine = bsObj.findAll("li", {"class": "first essential comment"})
ten = bsObj.findAll("div", {"class": "track"})
firstessentialfin = [ a.getText().strip() for a in one ]
essentialgreyhound = [ b.getText().strip() for b in two]
timeDistance = [ c.getText().strip() for c in four]
trap = [ d.getText().strip() for d in five ]
sp = [ e.getText().strip() for e in six ]
timeSec = [ f.getText().strip() for f in seven]
essentialtrainer = [ g.getText().strip() for g in eight]
firstessentialcomment = [ h.getText().strip() for  in nine]
track = [ i.getText().strip() for i in ten]
with open('lugs.csv', 'wb') as csvfile:
writer = csv.writer(csvfile, delimiter=",")
for f in zip(firstessentialfin, essentialgreyhound, trap, timeSec,  timeDistance,sp,track):
    writer.writerow(f)

Could it be possible because you have nested divs? So you could find in the first div level for "track" class and there is none... — Richard, Aug 26 '15 at 14:56
When you say it won't work, what do you mean? You never did a `getText()` for anything in the `track` variable. Also the indentation is incorrect in the last 4 lines but that might just be a copy paste error and not a problem with your code. — dstudeba, Aug 26 '15 at 17:55
"It won't do it, guys! I'm doing everything right and this library just won't do what I tell it!" Unlikely. BeautifulSoup is a well-tested popular library. You're obviously doing something wrong. Please post at least an example of your input, your expected output, and what you have tried in a [minimum, complete, and verifiable form](https://stackoverflow.com/help/mcve). — Two-Bit Alchemist, Aug 26 '15 at 18:21
hi both,thanks for the answers.Richard,not too sure what you mean by nested divs,Ithought BF would just find anything that was pointed at it.dstudeba,I have now included all the code including the get text() and added it to the zip at the bottom,I get the following message...line 37, in writer.writerow(f) UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 9: ordinal not in range(128)...I have done a search on this and it seems to have someting to do with the text style in the html,but the HTML is the same throughout the code. — looknow, Aug 26 '15 at 18:27
Hi Two-Bit Alchemist..the code does work if you exclude the ...track = bsObj.findAll("div", {"class": "track"})....and all it's associated code.This is the confusing bit because I can't see what is preventing BS from sorting this out because I am applying the same principles to identical problems...well I think I am anyway :) :) :) — looknow, Aug 26 '15 at 18:38

score 0 · Answer 1 · edited May 23 '17 at 12:14

0

Please see this answer: UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 20: ordinal not in range(128)

Then try changing

track = [ i.getText().strip() for i in ten]

to

track = [ i.getText().encode('utf-8').strip() for i in ten]

I believe your problem is you have a unicode character that the str can't handle

edited May 23 '17 at 12:14

Community

1
1

answered Aug 28 '15 at 05:10

dstudeba

8,878
3
32
41

Beautiful Soup wont extract specific HTML

1 Answers1