2

I want to find the value of a <td> that "belongs" to a <th>? I can search for the text in the <th> tag and find it, but I do not know the value and there is no class to search for. The number of columns can vary as well. So all I have is the text in the <th>.

Example of a table:

<table>
    <tbody>
        <tr>
            <th colspan="8">
                <span>
                    <a href="/link">Table Title</a>
                </span>
            </th>
        </tr>
        <tr>
            <th>Info1</th>
            <th>Info2</th>
            <th>Info3</th>
            <th>Info4</th>
            <th>Info5</th>
        </tr>
        <tr>
            <td>Value1</td>
            <td>Value2</td>
            <td>Value3</td>
            <td>Value4</td>
            <td>Value5</td>
        </tr>
    </tbody>
</table>

Let's say I want to find Value4 which "belongs" to Info4, how is this possible in BeautifulSoup?

Python 3.7.4 and BeautifulSoup 4.9.3.

2by
  • 1,083
  • 5
  • 22
  • 39

3 Answers3

2

Could use pandas to get the table and grab that column:

html = '''
<table>
    <tbody>
        <tr>
            <th colspan="8">
                <span>
                    <a href="/link">Table Title</a>
                </span>
            </th>
        </tr>
        <tr>
            <th>Info1</th>
            <th>Info2</th>
            <th>Info3</th>
            <th>Info4</th>
            <th>Info5</th>
        </tr>
        <tr>
            <td>Value1</td>
            <td>Value2</td>
            <td>Value3</td>
            <td>Value4</td>
            <td>Value5</td>
        </tr>
    </tbody>
</table>'''

Pandas:

import pandas as pd

df = pd.read_html(html, header=1)[0]
item4 = list(df['Info4'])

Output:

print (item4)
['Value4']

Adding onto Akasha, that loop can be a single line by using the .index() on the list.

idx = [x.text for x in tr.find_all('th')].index('Info4')

would be the same as:

for i, th in enumerate(tr.find_all('th')):
    if th.text == 'Info4':
        idx = i
chitown88
  • 27,527
  • 4
  • 30
  • 59
  • Awesome, thanks! Any idea which method is faster in regards to performance? I need to do it on a lot of data. :) – 2by Jan 14 '21 at 17:20
  • Well pandas uses beautiful soup under the hood. So not sure if it’s much of a difference. I’m not entirely sure how pandas was designed to implement it. I’m going to guess is either it’ll be about the same, or pandas would be faster to use. But honestly it’s just a guess. – chitown88 Jan 14 '21 at 18:03
1
tr = soup.find_all('tr')[1] #instead of this you can search for Info4 and take its parent tr

for i, th in enumerate(tr.find_all('th')):
    if th.text == 'Info4':
        idx = i

This index can be used to access the value which belongs to the chosen header.

tr = soup.find_all('tr')[2] 
value = tr.find_all('td')[idx]
0

You said you could get the TH value (info1, info2...).

So based on this, I've done this code very simple.

If you want, you can upgrade this (and should) but already work if you really can get the "info" position from apart.

The index.html is your HTML sample.

The idea is to map the "info" position (example, I want the value of info2), then run in the tds part to get the equal td (position 10 for info10, is position 10 for value10).

from bs4 import BeautifulSoup

file = open('index.html')

soup = BeautifulSoup(file, 'html.parser')

text = soup.find_all('tr')


cont = 0
ths = 0
tds = 0


textS = 'Info3'
pos = 0


for word in text:

    if '<tr>' in str(word):
        cont += 1

    if cont == 2:

        for son in word:
            if '<th>' in str(son):
                ths += 1

            if textS in son:
                pos = ths

    if cont == 3:
        for son in word:
            if '<td>' in str(son):
                tds += 1

            if tds == pos:
                print(son)
Lucas lima
  • 26
  • 5