Wait for page to load before scraping

Question

Im trying to scape multiple pages of a football website. All the links are are in the list teamLinks. An example of one of the links is: 'http://www.premierleague.com//clubs/1/Arsenal/squad?se=79'. I was just wondering if it was possible to make the requests function wait until the page is fully updated before it is implemented. If you click on the link it will display initially the 2018/2019 squad and then refresh to the 2017/2018 squad which is the one i want.

playerLink1 = []
playerLink2 = []

for i in range(len(teamLinks)):

    # Request
    squadPage = requests.get(teamlinks[i])
    squadTree = html.fromstring(squadPage.content)

    #Extract the player links.
    playerLocation = squadTree.cssselect('.playerOverviewCard')

    #For each player link within the team page.
    for i in range(len(playerLocation)):

        #Save the link, complete with domain.
        playerLink1.append("http://www.premierleague.com/" + 
        playerLocation[i].attrib['href'] + '?se=79')
        #For the second link, change the page from player overview to stats
        playerLink2.append(playerLink1[i].replace("overview", "stats"))

score 3 · Accepted Answer · edited Dec 10 '21 at 20:06

The page you are trying to scrape is using Javascript to load the player list which you want.

Option 1: You can use this new module called requests-html(never tried myself) which claims to support Javascript.

Option 2: Using devtools of Chrome, I could find the actual XHR request made by page to get the player list. This code can get your required output with requests module.

import json
playerLink1 = []
playerLink2 = []
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.75 Safari/537.36',
'Origin': 'https://www.premierleague.com',
'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8',
'Referer': 'https://www.premierleague.com//clubs/1/Arsenal/squad?se=79'}

res = requests.get('https://footballapi.pulselive.com/football/teams/1/compseasons/79/staff?altIds=true&compCodeForActivePlayer=EN_PR', headers=headers)

player_data = json.loads(res.content.decode('utf-8'))

for player in player_data['players']:
    href = 'https://www.premierleague.com/players/{}/{}/'.format(player['id'], player['name']['display'].replace(' ', '-'))
    playerLink1.append("http://www.premierleague.com/" + href + "overview" + '?se=79')
    playerLink2.append(href + "stats")

Yes this worked perfectly however without the addition of "http://www.premierleague.com/" and '?se=79' to the url — Nico Sánchez, Mar 18 '19 at 12:49
Actually, I saw such addition in your code in question. Anyways, if you have found solution to your problem then please select an answer to close the question. — Kamal, Mar 19 '19 at 01:52

score 2 · Answer 2 · answered Mar 17 '19 at 23:26

I have found one solution.You have to use selenium webdriver in headless mode and get the page_source from driver and give some time.sleep().I have checked the data it showing as expected.

However I don't know your url list so you can create your list and try that.Let me know if you need further helps.

from selenium import webdriver
from bs4 import BeautifulSoup
import time

teamlinks=['http://www.premierleague.com//clubs/1/Arsenal/squad?se=79','http://www.premierleague.com//clubs/1/Arsenal/squad?se=54']
playerLink1 = []
playerLink2 = []


    for i in range(len(teamlinks)):
        chrome_options = webdriver.ChromeOptions()
        chrome_options.add_argument('--headless')
        chrome_options.add_argument('window-size=1920x1080');
        driver = webdriver.Chrome(options=chrome_options)
        driver.get(teamlinks[i])
        time.sleep(10)
        squadPage=driver.page_source
        soup = BeautifulSoup(squadPage, 'html.parser')
        playerLocation = soup.findAll('a', class_=re.compile("playerOverviewCard"))
        for i in range(len(playerLocation)):

            #Save the link, complete with domain.
            playerLink1.append("http://www.premierleague.com/" +
            playerLocation[i]['href'] + '?se=79')
            #For the second link, change the page from player overview to stats
            playerLink2.append(playerLink1[i].replace("overview", "stats"))
        driver.quit()
    print(playerLink2)

Wait for page to load before scraping

2 Answers2

Linked