1

Just started learning python (3.8), building a scraper to get some football stats. Here's the code so far.

I originally wanted to pull a div with id = 'div_alphabet' which is clearly in the html tree on the website, but for some reason bs4 wasn't pulling it in. I investigated further and noticed that when I pull in the parent div 'all_alphabet' and then look for all child divs, 'div_alphabet' is missing. The only thing weird about the html structure is the long block comment that sits right above 'div_alphabet'. Is this a potential issue?

https://www.pro-football-reference.com/players

import requests
from bs4 import BeautifulSoup

URL = 'https://www.pro-football-reference.com/'
homepage = requests.get(URL)
home_soup = BeautifulSoup(homepage.content, 'html.parser')

players_nav_URL = home_soup.find(id='header_players').a['href']

players_directory_page = requests.get(URL + players_nav_URL)
players_directory_soup = BeautifulSoup(players_directory_page.content, 'html.parser')

alphabet_nav = players_directory_soup.find(id='all_alphabet')
all_letters = alphabet_nav.find_all('div')
print(all_letters)
Python Learner
  • 185
  • 1
  • 12
  • 1
    What data are you actually needing ? Looking at the HTML the div with id 'all_alphabet' looks like a container without any useful information. There is however a ul tag with class="page_index" which has lots of data in it. Clarify the data needs would help get you where you need to be – AaronS Jul 16 '20 at 15:36
  • @AaronS I'd like to get the hrefs to the links to all the alphabetized player names. Within div_alphabet, it's a list of each link by letter. I know they'e just all the letters in the alphabet, and it would be exponentially easier to just be explicit about the request, but this is just for practice and I was wondering why bs4 wasn't pulling in this last div – Python Learner Jul 16 '20 at 15:50
  • Checkout this answer for collecting tags inside commented html. https://stackoverflow.com/questions/52679150/beautifulsoup-extract-text-from-comment-html – Murtaza Haji Jul 16 '20 at 15:52
  • @MurtazaHaji Thanks for replying - my problem is that I'm trying to pull in a div that comes right after commented HTML. – Python Learner Jul 16 '20 at 16:03

2 Answers2

1
links = [a['href'] for a in players_directory_soup.select('ul.page_index li div a')]
names = [a.get_text() for a in players_directory_soup.select('ul.page_index li div a')]

This gives you a list and names of all the relative links of alphabetised players.

I wouldn't concern yourself with the div_alphabet it doesn't have any useful information.

Here we are selecting the ul tag with class "page_index". But you'll get a list, so we need to do a for loop and grab the href attribute. The get_text() also gives you the names.

If you haven't come across list comprehensions then this would also be acceptable.

links = []
for a in players_directory_soup.select('ul.page_index li div a'):
    links.append(a['href'])

names = [] 
for a in players_directory_soup.select('ul.page_index li div a'):
    names.append(a.get_text())
AaronS
  • 2,245
  • 2
  • 6
  • 16
  • Thank you for following up, and I understand this gets to the same answer, but do you know why my original code was not pulling in the div_alphabet div? – Python Learner Jul 16 '20 at 16:03
  • 1
    It was, but there is no information within div_alphabet, except for a comment which soup doesn't parse. – AaronS Jul 16 '20 at 16:03
  • Ahh okay, thank you. Familiar with list comprehension so I will use this as a starting base. Appreciate your time investment! – Python Learner Jul 16 '20 at 16:07
1

Something like this cod will make it:

import requests
from bs4 import BeautifulSoup


headers = {'User-Agent': 'Mozilla/5.0 '}
r = requests.get('https://www.pro-football-reference.com/players/', headers=headers)

soup = BeautifulSoup(r.text, 'lxml')
data = soup.select('ul.page_index li div')
for link in data:
    print(*[f'{a.get("href")}\n' for a in link.select('a')])

A more useful way to do this is to make a DataFrame with pandas of it and save it as a csv or something:

import requests
from bs4 import BeautifulSoup
import pandas as pd

players = []

headers = {'User-Agent': 'Mozilla/5.0 '}
r = requests.get('https://www.pro-football-reference.com/players/', headers=headers)

soup = BeautifulSoup(r.text, 'lxml')
data = soup.select('ul.page_index li div a')
for link in data:
    players.append([link.get_text(strip=True), 'https://www.pro-football-reference.com' + link.get('href')])
print(players[0])
df = pd.DataFrame(players, columns=['Player name', 'Url'])
print(df.head())
df.to_csv('players.csv', index=False)
UWTD TV
  • 910
  • 1
  • 5
  • 11