re-formatting HTML using python 2.7 and BS4

Question

I have a piece of software that exports race results as a htm file. I currently open this file in excel and reformat the results tables into a format suitable to be uploaded into a mysql databse, where lots of number crunching takes place using php.

I want to automate the re-format/parse/scrape, and searches on the web keep suggesting python 2.7 with beuatifulsoup4 which I started to play around with, but I have not used either before...

This is a look at the original table structure

<h1>CLASS 1</h1>
    <table class="results">
        <tr><th nowrap="1">Name</th><th nowrap="1">Town</th><th nowrap="1">Bike</th><th nowrap="1">Penalty</th></tr>
        <tr><td class="rider">RIDER 1</td><td></td><td></td><td>01:14:20</td></tr>
        <tr><td colspan="7"><table class="laps"><tr><td>00:26:36</td><td>00:19:51</td><td>00:27:54</td></tr></table></td></tr>
        <tr><td class="rider">RIDER 2</td><td></td><td></td><td>00:41:06</td></tr>
        <tr><td colspan="7"><table class="laps"><tr><td>00:19:10</td><td>00:21:57</td></tr></table></td></tr>
        <tr><td class="rider">RIDER 3</td><td></td><td></td><td>00:36:59</td></tr>
        <tr><td colspan="7"><table class="laps"><tr><td>00:37:00</td></tr></table></td></tr>
        <tr><td class="rider">RIDER 4</td><td></td><td></td><td>01:26:41</td></tr>
        <tr><td colspan="7"><table class="laps"><tr><td>01:26:42</td></tr></table></td></tr>
    </table>
<h1>CLASS 2</h1>

I would like the export to be one table for each class with all the info for eash rider on a single row, like this...

NAME1 02:26:4 200:12:42 00:13:04 00:13:25 00:13:19 00:13:22 00:13:29 00:13:44
NAME2: 02:41:06 00:13:17 00:14:10 00:13:40 00:13:38 00:13:47 00:13:12 00:13:24

Playing around in python I got so far as to reading file using beautifulsoup.

from bs4 import BeautifulSoup

with open(r'test.htm', "r") as f:
    pagebuffer = f.read()  
soup = BeautifulSoup(pagebuffer, "lxml")

After inspecting the html I was able to search the soup for the relevant class names.

riders = soup.find_all(class_="rider")
for item in riders:
    print item.text

NAME 1
NAME 2
NAME 3
NAME 4
NAME 5
NAME 6
NAME 7
NAME 8
NAME 9
NAME 10
NAME 11
NAME 12
NAME 13
NAME 14
NAME 15
NAME 16
NAME 17
NAME 18
NAME 19

laps = soup.find_all(class_="laps")
for item in laps:
    print item.text

00:12:4200:13:0400:13:2500:13:1900:13:2200:13:2900:13:4400:13:3000:13:2000:13:3800:13:10
00:12:2600:13:1700:14:1000:13:4000:13:3800:13:4700:13:1200:13:2400:13:2500:13:4700:13:43
00:12:3100:13:1300:13:2200:13:5200:13:5500:14:0800:13:2500:13:4500:13:5300:13:4400:13:25
00:14:2300:14:2600:15:0100:14:5300:14:5800:14:3100:14:4400:15:3300:14:1900:14:14
00:13:5700:13:4800:14:1900:14:3200:14:5100:15:0300:14:3600:17:5700:14:4200:14:39
00:14:1100:14:3200:14:4300:14:2300:14:5900:14:4600:15:1000:15:0500:15:1400:16:13
00:13:4100:13:3200:14:0000:14:0100:14:3200:14:1000:14:3600:14:2100:28:5500:14:17
00:13:3000:13:3900:14:00

02:36:4900:13:2800:13:3700:13:5600:13:5700:14:4600:14:1700:14:2700:15:1800:14:3800:14:1100:14:15
02:36:5800:13:5900:13:4900:14:1900:14:1100:14:2300:14:2700:14:2700:14:2400:14:2600:14:1300:14:21
02:27:0100:14:2300:14:2600:15:0100:14:5300:14:5800:14:3100:14:4400:15:3300:14:1900:14:14
02:28:2300:13:5700:13:4800:14:1900:14:3200:14:5100:15:0300:14:3600:17:5700:14:4200:14:39
02:29:1500:14:1100:14:3200:14:4300:14:2300:14:5900:14:4600:15:1000:15:0500:15:1400:16:13
02:36:0400:13:4100:13:3200:14:0000:14:0100:14:3200:14:1000:14:3600:14:2100:28:5500:14:17
00:41:0800:13:3000:13:3900:14:00

This is where I'm stuck...
1. How do I combine those two searches (riders, laps)together?
2. The total time is not defined by a class name, how can I search to the 3rd [td] tag after the class rider [td] tag?
3. Will I be able to make this executable on machines that dont have python and bs4 installed or should i look at other coding methods?

here is a link to a typical htm file: http://www.kr3w.co.uk/downloads/test.htm

alecxe · Answer 1 · 2016-04-06T13:19:51.850

For each rider, you would need to get the next row using find_next_sibling().

Complete implementation:

from pprint import pprint

from bs4 import BeautifulSoup

data = """
<table class="results">
    <tr><th nowrap="1">Name</th><th nowrap="1">Town</th><th nowrap="1">Bike</th><th nowrap="1">Penalty</th></tr>
    <tr><td class="rider">RIDER 1</td><td></td><td></td><td>01:14:20</td></tr>
    <tr><td colspan="7"><table class="laps"><tr><td>00:26:36</td><td>00:19:51</td><td>00:27:54</td></tr></table></td></tr>
    <tr><td class="rider">RIDER 2</td><td></td><td></td><td>00:41:06</td></tr>
    <tr><td colspan="7"><table class="laps"><tr><td>00:19:10</td><td>00:21:57</td></tr></table></td></tr>
    <tr><td class="rider">RIDER 3</td><td></td><td></td><td>00:36:59</td></tr>
    <tr><td colspan="7"><table class="laps"><tr><td>00:37:00</td></tr></table></td></tr>
    <tr><td class="rider">RIDER 4</td><td></td><td></td><td>01:26:41</td></tr>
    <tr><td colspan="7"><table class="laps"><tr><td>01:26:42</td></tr></table></td></tr>
</table>
"""

soup = BeautifulSoup(data, "html.parser")
data = []
for rider in soup.select("td.rider"):  # all td elements having `rider` class
    rider_name = rider.get_text()
    # getting the last td element in this row - total time
    total_time = rider.parent.find_all("td")[-1].get_text()

    # getting laps from the next row from the current
    laps = [td.get_text() for td in rider.parent.find_next_sibling('tr').select("table.laps tr td")]

    data.append([rider_name, total_time] + laps)

pprint(data)

Prints:

[['RIDER 1', '01:14:20', '00:26:36', '00:19:51', '00:27:54'],
 ['RIDER 2', '00:41:06', '00:19:10', '00:21:57'],
 ['RIDER 3', '00:36:59', '00:37:00'],
 ['RIDER 4', '01:26:41', '01:26:42']]

And, since you have multiple tables in your real HTML, you need process it table by table. Working example:

from pprint import pprint

import requests
from bs4 import BeautifulSoup

url = "http://www.kr3w.co.uk/downloads/test.htm"
response = requests.get(url, headers={"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.110 Safari/537.36"}).content
soup = BeautifulSoup(response, "html.parser")


for table in soup.select("table.results"):
    title = table.find_previous_sibling("h1").get_text()

    data = []
    for rider in table.select("td.rider"):  # all td elements having `rider` class
        rider_name = rider.get_text()
        # getting the last td element in this row - total time
        total_time = rider.parent.find_all("td")[-1].get_text()

        # getting laps from the next row from the current
        laps = [td.get_text() for td in rider.parent.find_next_sibling('tr').select("table.laps tr td")]

        data.append([rider_name, total_time] + laps)

    print(title)
    pprint(data)
    print("-----")

Thank you for the suggestion, that has moved me a step closer to the solution. I have a couple of questions though. When I run your code example it works just as i wanted, but my results are returned as unicode and your results look like a string. Is there a reason for this, or is it anything I need to be concerned about? When I replace the html sample section with my open file command the results are all in a single column. Would this indicate the html is broken? — SG129, Apr 06 '16 at 10:03
@SG129 don't worry about the `u` - it just means a [unicode string](http://stackoverflow.com/questions/2464959/whats-the-u-prefix-in-a-python-string). The reasons I don't get that is that I'm on Python 3. As for your last question, you need to loop over all of the tables and process them separately - I've added a sample to the answer. Hope that helps. — alecxe, Apr 06 '16 at 13:21
I can see that's very nearly working now, but as I scroll through the code output only a small number are actually displayed correctly in a single row, all of the others are in a column underneath the name. — SG129, Apr 07 '16 at 15:41

re-formatting HTML using python 2.7 and BS4

1 Answers1