0

I'm trying to find an element that's a tbody nested inside the all_totals id (it's definitely there, I checked).

import requests
from bs4 import BeautifulSoup, Comment

url = 'https://www.basketball-reference.com/players/a/abdelal01.html'
data = requests.get(url)
html = BeautifulSoup(data.text, 'html.parser')

print(html.select('#all_totals tbody').prettify())

However, this beautiful soup code just returns an empty array. I thought the problem might somehow be caused by the desired element sitting under a GIANT html comment. I added some code to attempt to parse the html to get rid of the comment:

for comment in html.findAll(text=lambda text: isinstance(text, Comment)):
    comment.extract()
print(html.select('#all_totals')[0].prettify())

This worked in getting rid of the comment; however, most (but not all) of the html nested within the 'all_totals' id disappeared after doing this.

What am I doing wrong, and how can I correctly select the html that I want?

Maaz
  • 2,405
  • 1
  • 15
  • 21
16jacobj
  • 443
  • 1
  • 4
  • 11

2 Answers2

2

You don't want to use extract as you will remove the comments which contain the html of interest. See the following as an example of extracting from comment instead

import pandas as pd

for comment in html.findAll(text=lambda text: isinstance(text, Comment)):
    if 'id="totals"' in comment:
        table = pd.read_html(comment)[0]
        print(table)
        break
QHarr
  • 83,427
  • 12
  • 54
  • 101
  • 1
    thanks for the response, but the html I wanted is below a comment, not inside a comment, sorry if that was poorly explained in my post – 16jacobj Aug 16 '19 at 21:26
1

You can use selenium to find directly the tbody, because it is loaded by javascript.

Try this:

from bs4 import BeautifulSoup, Comment
from selenium import webdriver

url = 'https://www.basketball-reference.com/players/a/abdelal01.html'
driver = webdriver.Firefox()
driver.get(url)
html = BeautifulSoup(driver.page_source)

print(html.find('div', {'id':'all_totals'}).find('tbody').prettify())

for comment in html.findAll(text=lambda text: isinstance(text, Comment)):
    comment.extract()
print(html.find('div', {'id': 'all_totals'}).prettify())
Maaz
  • 2,405
  • 1
  • 15
  • 21
  • thanks so much, this worked! whenever I run this code, it opens up the firefox link, is there anyway around that? – 16jacobj Aug 16 '19 at 21:28
  • 1
    found a solution to stop firefox opening up a window for anyone curious: https://stackoverflow.com/a/46768243/8780895 – 16jacobj Aug 16 '19 at 23:42