0

I'm trying to extract the Most Common Batting Orders from http://www.baseball-reference.com/teams/SFG/2017-batting-orders.shtml

import bs4
import urllib.request as urllib

url = 'http://www.baseball-reference.com/teams/SFG/2017-batting-orders.shtml'
html = urllib.urlopen(url).read() 
batting_order_soup = bs4.BeautifulSoup(html, "html.parser")
table = batting_order_soup.find("table", attrs={"class":"stats_table nav_table"})

>>> print(table)
None

I would expect to see a table with 6 Games, 4 Games, 4 Games, 3 Games 2 Games. Under the 6 Games column Span, Nunez, Belt, etc.

In the browser, I see both the 6 Games in the comments and also in html e.g.

<table class="stats_table nav_table" id="st_0"><tbody><tr class="rowSum">
<td valign="top"><strong>6 Games</strong><p></p><li value="1">
 <a data-entry-id="spande01" href="/players/s/spande01.shtml" 
title="Denard Span">Span</a> </li>
<li value="2"><a data-entry-id="nunezed02" href="/players/n/nunezed02.shtml"
title="Eduardo Nunez">Nunez</a></li>

Is there a way within beautifulsoup to be able to extract the table? I do see in the batting_order_soup (i.e. print(batting_order_soup) that contains no-js, so perhaps as noted in the comments below that the javascript isn't run. I presume we can't get bs4 to run js? Can someone provide an example how to extract the table embedded in the comments?

The code below can be run interactively. So if you were to say run

table = batting_order_soup.find("table")
print(table)

You will get the first table data which is Batting Order.

Thank you, -Raj

Raj
  • 221
  • 1
  • 2
  • 10
  • Can you try something like `attrs={'class':['stats_table', 'nav_table']}` – Damien Oct 05 '18 at 06:24
  • See how to create a [mcve]. – Peter Wood Oct 05 '18 at 06:28
  • that table is inside a comment in your particular case, that's why you're not able to find it – fernandezcuesta Oct 05 '18 at 06:31
  • @fernandezcuesta it appears twice, once in a comment, once outside. – Peter Wood Oct 05 '18 at 08:38
  • Can you reduce the value of html down to a minimum example? I wonder whether the page uses JavaScript. Looking at the page source there isn't `stats_table nav_table`, but in the browser there is. I think it's being post processed in JavaScript. You might need to use something like [tag:selenium] – Peter Wood Oct 05 '18 at 08:45
  • @PeterWood checking the contents of `html` and `batting_order_soup`, there is a `stats table nav_table` class table, but inside a commented `
    `. Same as checking page source on Firefox (only one occurrence, commented)
    – fernandezcuesta Oct 05 '18 at 10:59
  • 1
    @fernandezcuesta the page source is different to the DOM. If you inspect the page using the developer tools there is post processing of the tables for example to allow sorting. – Peter Wood Oct 05 '18 at 11:03
  • ok, I added a bit more information - hopefully meets the minimal,complete,verifiable example. – Raj Oct 06 '18 at 07:01

1 Answers1

0

So the issue here is that the tag you're interested in is a comment. The data exists when loaded in the browser, but when you pull it with Python - e.g. without loading Javascript and such - it's only a comment.

So the easy way to get the data IMHO is actually extracting all the comments (take a look at this answer), then getting the right one, create a new BeautifulSoup-object and then parse that.

So a working code for that solution would look like this:

import requests
from bs4 import BeautifulSoup, Comment
from pprint import pprint

r = requests.get("http://www.baseball-reference.com/teams/SFG/2017-batting-orders.shtml")
soup = BeautifulSoup(r.text, "html.parser")
comments = soup.find_all(string=lambda text:isinstance(text,Comment))

# the element we need has the sentence 'stats_table nav_table' in it
for comment in comments:
    if 'stats_table nav_table' in comment:
        table_soup = BeautifulSoup(comment, "html.parser")

table = table_soup.find('table')
tds = table.find_all('td')
return_dict = {}

for td in tds:
    header = td.find('strong').get_text()
    batter_list = td.find_all('li')
    batter_list = [batter.get_text() for batter in batter_list]
    return_dict[header] = batter_list

pprint(return_dict)
jlaur
  • 740
  • 5
  • 13
  • Thank you, jlaur! The code works - really odd that the website would take this approach when other stats on their pages come back in a straightforward manner. – Raj Oct 08 '18 at 21:47