1

Here's what my HTML looks like:

<head> ... </head>

<body>
    <div>
        <h2>Something really cool here<h2>
        <div class="section mylist">
            <table id="list_1" class="table">
                <thead> ... not important <thead>
                <tr id="blahblah1"> <td> ... </td> </tr> 
                <tr id="blah2"> <td> ... </td> </tr> 
                <tr id="bl3"> <td> ... </td> </tr> 
            </table>
        </div>
    </div>
</body>

Now there are four occurrences of this div in my html file, each table content is different and each h2 text is different. Everything else is relatively the same. What I've been able to do so far is extract out the parent of each h2 - however, now I am not sure how to extract out each tr where in then, I can extract out the td that I really need.

Here is the code I've written so far...

from bs4 import BeautifulSoup
soup = BeautifulSoup(open('myhtml.html'), 'html.parser')

currently_watching = soup.find('h2', text='Something really cool here')
parent = currently_watching.parent
user1883614
  • 905
  • 3
  • 16
  • 30

2 Answers2

2

I would suggest finding the parent div, which actually encloses the table, and then search for all td tags. Here's how you'd do it:

from bs4 import BeautifulSoup
soup = BeautifulSoup(open('myhtml.html'), 'lxml')

div = soup.find('div', class_='section mylist')    
for td in div.find_all('td'):
    print(td.text)
cs95
  • 379,657
  • 97
  • 704
  • 746
  • For some reason, it gives me a `TypeError: 'NoneType' object is not callable` in the first line of that for loop – user1883614 Aug 23 '17 at 07:20
  • @user1883614 It should be `find_all`. Sorry. – cs95 Aug 23 '17 at 07:27
  • It's not printing anything - I double checked what I've provided above and it matches the code that i have... – user1883614 Aug 23 '17 at 07:36
  • @user1883614 That is disappointing. Try `print(td.string)` and then `print(td.content)`, whichever works. One of them should. Let me know which works. – cs95 Aug 23 '17 at 07:43
  • So I tried doing `print len(div.find_all('tr'))` which gave me a 0 and I've been looking online for some other code examples but I can't seem to figure out why it won't read it. – user1883614 Aug 23 '17 at 07:44
  • @user1883614 Weird. I tried it with your input and it works for me. Try perhaps, `open('myhtml.html').read()` instead? Also, you should be looking for `td`, not `tr`. – cs95 Aug 23 '17 at 07:46
  • Not sure if it matters but the div code is inside a `body` tag which is inside `html`. Updated the code from above – user1883614 Aug 23 '17 at 07:49
  • It makes no difference, because it worked for me. Sorry! – cs95 Aug 23 '17 at 07:50
0

Searched around a bit and realized that it was my parser that was causing the issue. I installed lxml and everything works fine now.

Why is BeautifulSoup not finding a specific table class?

user1883614
  • 905
  • 3
  • 16
  • 30