I'm using beautiful soup to parse a HTML document in a Python object but I've run into a small problem.
I'm trying to convert a table into a list of dictionaries. I want the keys in the dictionary to be the column headers however the table has multiple header rows with varying numbers of th elements. In order for the dictionary keys to be valid, I'd need to do somehow merge the two header rows into concatenated versions of themselves.
This is what the header rows look like.
This is the underlying HTML
<thead>
<tr>
<th></th>
<th class="metadata platform"></th>
<th class="wtt time borderleft" colspan="2"><abbr title="Working Timetable">WTT</abbr></th>
<th class="gbtt time borderleft" colspan="2"><abbr title="Public Timetable (Great Britain Timetable)">GBTT</abbr></th>
<th class="metadata line path borderleft" colspan="2">Route</th>
<th class="metadata allowances borderleft" colspan="3">Allowances</th>
</tr>
<tr>
<th>Location</th>
<th class="metadata platform span2">Pl</th>
<th class="wtt time span3 borderleft">Arr</th>
<th class="wtt time span3">Dep</th>
<th class="gbtt time span3 borderleft">Arr</th>
<th class="gbtt time span3">Dep</th>
<th class="metadata line span2 borderleft">Line</th>
<th class="metadata path span2">Path</th>
<th class="metadata allowances engineering span2 borderleft"><abbr title="Engineering allowance">Eng</abbr></th>
<th class="metadata allowances pathing span2"><abbr title="Pathing allowance">Pth</abbr></th>
<th class="metadata allowances performance span2"><abbr title="Performance allowance">Prf</abbr></th>
</tr>
</thead>
This is ideally the output I need so I can then do some dictionary comprehension to build the list.
['Location', 'Pl', 'WTT Arr', 'WTT Dep', 'GBTT Arr',
'GBTT Dep', 'Route Line', 'Route Path', 'Allowances Eng',
'Allowances Pth', 'Allowances Prf']
The only way that I can think it do this is to loop through each th element and build the headers that way. So here, I'd end up with a list of 11 elements that would take two 'passes' to build.
# First pass
['', '', 'WTT', 'WTT', 'GBTT',
'GBTT', 'Route', 'Route', 'Allowances ',
'Allowances', 'Prf']
# Second pass
['Location', 'Pl', 'WTT Arr', 'WTT Dep', 'GBTT Arr',
'GBTT Dep', 'Route Line', 'Route Path', 'Allowances Eng',
'Allowances Pth', 'Allowances Prf']
While this is a solution that works, I'd like to think there's a more pythonic way of doing it.
EDIT: Code for creating the dictionary keys:
from bs4 import BeautifulSoup
import requests
url = 'http://www.realtimetrains.co.uk/train/P16871/2018/12/10/advanced'
bs = BeautifulSoup(requests.get(url).content, 'lxml')
table = bs.find_all('table', class_='advanced')
headers = table[0].select('thead tr ')
keys = []
for th in headers[0].findChildren('th'):
keys.append(th.getText())
try:
colspan = int(th['colspan'])
if colspan > 0:
for i in range(0, colspan-1):
keys.append(th.getText())
except KeyError:
pass
th_elements = list(headers[1].findChildren('th'))
for i in range(0, len(keys)):
keys[i] = keys[i] + ' ' + th_elements[i].getText()
keys[i] = keys[i].strip()
print(keys)