0

web page: https://fbref.com/en/comps/9/gca/Premier-League-Stats

I have scraped the top table and I'm now attempting to scrape the second.

import requests
from bs4 import BeautifulSoup

URL = 'https://fbref.com/en/comps/9/gca/Premier-League-Stats'
page = requests.get(URL)


soup = BeautifulSoup(page.content, 'html.parser')


stepa= soup.find(id="all_stats_gca")

the above works fine but then i cannot go any further? I would have thought the next step would be

stepb=stepa.find("div",{"class":"table_outer_container"})

but when printing this returns none. any other suggestions?

  • 1
    After quickly checking the source code of that page, I didn't see any div with a class named `table_outer_container` inside of the div with the id `all_stats_gca` – revliscano May 11 '20 at 21:49
  • Maybe you want the div with the id `all_stats_gca_squads` – revliscano May 11 '20 at 21:50
  • Pretty sure it's there though a little way down, and no I used `all_stats_gca_squads` for the first table i scraped @revliscano – Nenny Dunnazz May 11 '20 at 22:05
  • Oh yes, right. The problem is that the content you're interested in is commented. I checked that they add a class named `commented` to that div. They must be doing that as a way of protecting their data. You can see this by opening the source code (CTRL + U) instead of inspecting the elements in the devtools. – revliscano May 11 '20 at 22:26
  • Yes, I confirmed that they have a function in their js file to show the commented content. Nice protection from them, I must say. Will have it in mind for the future – revliscano May 11 '20 at 22:27
  • Thank you, as you can probably tell I'm still fairly knew to this, is there a link or anything you can direct me to that might help me understand better why i cant scrape this? or should reading up more on HTML be enough? @revliscano – Nenny Dunnazz May 11 '20 at 22:36
  • No problem, pal. Check [this answer](https://stackoverflow.com/questions/33138937/how-to-find-all-comments-with-beautiful-soup), it might be helpful. – revliscano May 11 '20 at 22:38
  • I have added an answer with a workaround for your case. I hope it helps. – revliscano May 12 '20 at 20:33

1 Answers1

0

As I said in the comments, the problem with the page that you're trying to parse is that they commented the div with the class table_outer_container, therefore you are getting None when you call the find() method. (that commented div is being ignored from the resultset of stepa).

Now, (based on this answer) as a workaround you can do something as follows to get that commented div:

stepb = stepa.find_all(string=lambda text: isinstance(text, Comment))
comment_content = stepb[0].extract().replace('\n', ' ').replace('\t', ' ')
new_soup = BeautifulSoup(comment_content, 'html.parser')

table_outer_container = new_soup.find("div",{"class":"table_outer_container"})
revliscano
  • 2,227
  • 2
  • 12
  • 21