2

I'm using beautiful soup to parse a HTML document in a Python object but I've run into a small problem.

I'm trying to convert a table into a list of dictionaries. I want the keys in the dictionary to be the column headers however the table has multiple header rows with varying numbers of th elements. In order for the dictionary keys to be valid, I'd need to do somehow merge the two header rows into concatenated versions of themselves.

This is what the header rows look like. Source HTML

This is the underlying HTML

<thead>
   <tr>
      <th></th>
      <th class="metadata platform"></th>
      <th class="wtt time borderleft" colspan="2"><abbr title="Working Timetable">WTT</abbr></th>
      <th class="gbtt time borderleft" colspan="2"><abbr title="Public Timetable (Great Britain Timetable)">GBTT</abbr></th>
      <th class="metadata line path borderleft" colspan="2">Route</th>
      <th class="metadata allowances borderleft" colspan="3">Allowances</th>
   </tr>
   <tr>
      <th>Location</th>
      <th class="metadata platform span2">Pl</th>
      <th class="wtt time span3 borderleft">Arr</th>
      <th class="wtt time span3">Dep</th>
      <th class="gbtt time span3 borderleft">Arr</th>
      <th class="gbtt time span3">Dep</th>
      <th class="metadata line span2 borderleft">Line</th>
      <th class="metadata path span2">Path</th>
      <th class="metadata allowances engineering span2 borderleft"><abbr title="Engineering allowance">Eng</abbr></th>
      <th class="metadata allowances pathing span2"><abbr title="Pathing allowance">Pth</abbr></th>
      <th class="metadata allowances performance span2"><abbr title="Performance allowance">Prf</abbr></th>
   </tr>
</thead>

This is ideally the output I need so I can then do some dictionary comprehension to build the list.

['Location', 'Pl', 'WTT Arr', 'WTT Dep', 'GBTT Arr', 
 'GBTT Dep', 'Route Line', 'Route Path', 'Allowances Eng', 
 'Allowances Pth', 'Allowances Prf']

The only way that I can think it do this is to loop through each th element and build the headers that way. So here, I'd end up with a list of 11 elements that would take two 'passes' to build.

# First pass
['', '', 'WTT', 'WTT', 'GBTT', 
 'GBTT', 'Route', 'Route', 'Allowances ', 
 'Allowances', 'Prf']

# Second pass
['Location', 'Pl', 'WTT Arr', 'WTT Dep', 'GBTT Arr', 
 'GBTT Dep', 'Route Line', 'Route Path', 'Allowances Eng', 
 'Allowances Pth', 'Allowances Prf']

While this is a solution that works, I'd like to think there's a more pythonic way of doing it.

EDIT: Code for creating the dictionary keys:

from bs4 import BeautifulSoup
import requests

url = 'http://www.realtimetrains.co.uk/train/P16871/2018/12/10/advanced'

bs = BeautifulSoup(requests.get(url).content, 'lxml')
table = bs.find_all('table', class_='advanced')
headers = table[0].select('thead tr ')

keys = []
for th in headers[0].findChildren('th'):
    keys.append(th.getText())
    try:
        colspan = int(th['colspan'])
        if colspan > 0:
            for i in range(0, colspan-1):
                keys.append(th.getText())
    except KeyError:
        pass

th_elements = list(headers[1].findChildren('th'))
for i in range(0, len(keys)):
    keys[i] = keys[i] + ' ' + th_elements[i].getText()
    keys[i] = keys[i].strip()

print(keys)
brennan
  • 3,392
  • 24
  • 42
Jake
  • 1,701
  • 3
  • 23
  • 44
  • 1
    Can you [`edit`](https://stackoverflow.com/posts/53566815/edit) and include your code attempt into the question? But even that, with 2 passes, I can't think of a more efficient, or pythonic, way to do it. The way the page data is structured you have to do some kind of string analysis, which requires 1 pass to get the data into some format, and the 2nd pass to analyze and organize the data into that final list. So not sure how much real improvement can be made. – chickity china chinese chicken Dec 01 '18 at 01:50
  • @davedwards, sorry for the delay. Code added. – Jake Dec 01 '18 at 02:35
  • Thanks, and well done. That's way better and shorter than my attempt. If you can also include the headers too so it's a [mcve] perhaps someone more skilled than us can provide a better solution. – chickity china chinese chicken Dec 01 '18 at 02:49
  • 1
    Done. MCVE added. – Jake Dec 01 '18 at 02:52

1 Answers1

1

As an alternative approach, you could use pandas read_html (which also uses BeautifulSoup). Read the html into a dataframe, flatten the column names and output the result to a list of dicts.

import pandas as pd

df = pd.read_html('http://www.realtimetrains.co.uk/train/P16871/2018/12/10/advanced')[0]
df.columns = [' '.join([c for c in col if 'Unnamed' not in c]) 
              for col in df.columns.values]
df.to_dict(orient='records')

Giving:

[
  {
    'Location': 'Swansea [SWA]',
    'Pl': 3.0,
    'WTT Arr': nan,
    'GBTT Dep': 911.0,
    'Route Arr': nan,
    'Allowances Dep': 910.0,
    'Line': nan,
    'Path': nan,
    'Eng': nan,
    'Pth': nan,
    'Prf': nan
  }, 
  ...
]
brennan
  • 3,392
  • 24
  • 42