Unable to load tables with "Load more" options in a website using Python

Question

Need to scrape the full table from this site with "Load more" option.

As of now when I`m scraping , I only get the one that shows up by default on when loading the page.

import pandas as pd
import requests
from six.moves import urllib

URL2 = "https://www.mykhel.com/football/indian-super-league-player-stats-l750/"
header = {'Accept-Language': "en-US,en;q=0.9",
          'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 "
                        "(KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36"
          }

resp2 = requests.get(url=URL2, headers=header).text

tables2 = pd.read_html(resp2)
overview_table2= tables2[0]
overview_table2

	Player Name	Team	Matches	Goals	Time Played	Unnamed: 5
0	Jorge Pereyra Diaz	Mumbai City	9	6	538 Mins	NaN
1	Cleiton Silva	SC East Bengal	8	5	707 Mins	NaN
2	Abdenasser El Khayati	Chennaiyin FC	5	4	231 Mins	NaN
3	Lallianzuala Chhangte	Mumbai City	9	4	737 Mins	NaN
4	Nandhakumar Sekar	Odisha	8	4	673 Mins	NaN
5	Ivan Kalyuzhnyi	Kerala Blasters	7	4	428 Mins	NaN
6	Bipin Singh	Mumbai City	9	4	806 Mins	NaN
7	Noah Sadaoui	Goa	8	4	489 Mins	NaN
8	Diego Mauricio	Odisha	8	3	526 Mins	NaN
9	Pedro Martin	Odisha	8	3	263 Mins	NaN
10	Dimitri Petratos	ATK Mohun Bagan	6	3	517 Mins	NaN
11	Petar Sliskovic	Chennaiyin FC	8	3	662 Mins	NaN
12	Holicharan Narzary	Hyderabad	9	3	705 Mins	NaN
13	Dimitrios Diamantakos	Kerala Blasters	7	3	529 Mins	NaN
14	Alberto Noguera	Mumbai City	9	3	371 Mins	NaN
15	Jerry Mawihmingthanga	Odisha	8	3	611 Mins	NaN
16	Hugo Boumous	ATK Mohun Bagan	7	2	580 Mins	NaN
17	Javi Hernandez	Bengaluru	6	2	397 Mins	NaN
18	Borja Herrera	Hyderabad	9	2	314 Mins	NaN
19	Mohammad Yasir	Hyderabad	9	2	777 Mins	NaN
20	Load More....	Load More....	Load More....	Load More....	Load More....	Load More....

But I need the full table , including the data under "Load more", please help.

score 4 · Answer 1 · answered Dec 03 '22 at 17:55

4

import requests
import pandas as pd
from bs4 import BeautifulSoup

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:107.0) Gecko/20100101 Firefox/107.0'
}


def main(url):
    params = {
        "action": "stats",
        "league_id": "750",
        "limit": "300",
        "offset": "0",
        "part": "leagues",
        "season_id": "2022",
        "section": "football",
        "stats_type": "player",
        "tab": "overview"
    }
    r = requests.get(url, headers=headers, params=params)
    soup = BeautifulSoup(r.text, 'lxml')
    goal = [(x['title'], *[i.get_text(strip=True) for i in x.find_all_next('td', limit=4)])
            for x in soup.select('a.player_link')]
    df = pd.DataFrame(
        goal, columns=['Name', 'Team', 'Matches', 'Goals', 'Time Played'])
    print(df)


main('https://www.mykhel.com/src/index.php')

Output:

                      Name              Team Matches Goals Time Played
0       Jorge Pereyra Diaz       Mumbai City       9     6    538 Mins
1            Cleiton Silva    SC East Bengal       8     5    707 Mins
2    Abdenasser El Khayati     Chennaiyin FC       5     4    231 Mins
3    Lallianzuala Chhangte       Mumbai City       9     4    737 Mins
4        Nandhakumar Sekar            Odisha       8     4    673 Mins
..                     ...               ...     ...   ...         ...
268          Sarthak Golui    SC East Bengal       6     0    402 Mins
269          Ivan Gonzalez    SC East Bengal       8     0    683 Mins
270       Michael Jakobsen  NorthEast United       8     0    676 Mins
271       Pratik Chowdhary     Jamshedpur FC       6     0    495 Mins
272         Chungnunga Lal    SC East Bengal       8     0    720 Mins

[273 rows x 5 columns]

answered Dec 03 '22 at 17:55

αԋɱҽԃ αмєяιcαη

11,825
3
17
50

1

Can you please explain why you refer to the other link, an **index.php**, and where do you get those `params`? Is it a sort of API, is it documented for this website? – DiMithras Dec 03 '22 at 19:20
And what is `'lxml'` for, removing it does not change anything. – DiMithras Dec 03 '22 at 19:30
1

@DiMithras check [that](https://stackoverflow.com/a/60925669/7658985) and regarding `lxml` is just the type of parser which is the most quickest efficient one, removing it is letting the soup to use `html.parser` by default. – αԋɱҽԃ αмєяιcαη Dec 03 '22 at 20:40
@αԋɱҽԃαмєяιcαη ohh, cool! That explains a lot! I thought about dev tools and looked there, but missed clicking the button step. Now I can clearly see that link and params, also would make sense to put `Fetch/XHR` option to more easily find that record. Happy to learn something new today, much appreciated Yet [tag:Selenium] remains an option, doesn't it? – DiMithras Dec 03 '22 at 20:59
@DiMithras if you think that selenium is designed for web scraping, So you've to think twice on what you learned. Selenium is created for test cases purpose only. – αԋɱҽԃ αмєяιcαη Dec 03 '22 at 21:00
@αԋɱҽԃαмєяιcαη also for whatever reason I don't see `Params` tab in **Firefox**, but it can be found in **Chrome** under `Payload`. – DiMithras Dec 03 '22 at 21:06
@αԋɱҽԃαмєяιcαη yeah, I know it's a testing suite. As you say, "you don't need it for such a single page". But for something bigger? – DiMithras Dec 03 '22 at 21:08
@DiMithras https://i.imgur.com/DZ1F1bv.png isn't that a firefox ? and BTW it's not about small/bigger! selenium isn't for *web scraping* at all – αԋɱҽԃ αмєяιcαη Dec 03 '22 at 21:17
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/250122/discussion-between-dimithras-and--c). – DiMithras Dec 03 '22 at 21:18
Thanks Mate , but i`m not still able to understand how i`ll be able to scrape the data in other tables like "Passes" in the site. Also the result with main(url) function is a dataframe ? – Footilytics - Indian football Dec 05 '22 at 04:47

DiMithras · Answer 2 · 2022-12-03T18:55:33.500

This is a dynamically loaded page, so you can not parse all the contents without hitting a button.
Well… may be you can with XHR or smth like that, may be someone will contribute to the answers here.

I'll stick to working with dynamically loaded pages with Selenium browser automation suite.

Installation

To get started, you'll need to install selenium bindings:

pip install selenium

You seem to already have beautifulsoup, but for anyone who might come across this answer, we'll also need it and html5lib, we'll need them later to parse the table:

pip install html5lib BeautifulSoup4

Now, for selenium to work you'll need a driver installed for a browser of your choice. To get the drivers you may use Selenium Manager, Driver Management Software or download the drivers manually. The above mentioned options are something new, I have my manually downloaded drivers for ages, so I'll stick to them. I'll duplicate here the download links:

Browser	Link to driver download
Chrome:	https://sites.google.com/chromium.org/driver/
Edge:	https://developer.microsoft.com/en-us/microsoft-edge/tools/webdriver/
Firefox:	https://github.com/mozilla/geckodriver/releases
Safari:	https://webkit.org/blog/6900/webdriver-support-in-safari-10/
Opera:	https://github.com/operasoftware/operachromiumdriver/releases

You can use any browser, e.g. Brave browser, Yandex Browser, basically any Chromium based browser of your choice or even Tor browser
Anyway, it's a bit out of this answer scope, just keep in mind, for any browser and it's family you'll need a driver.

I'll stick with Firefox. Hence you need Firefox installed and driver placed somewhere. The best option would be to add this folder to PATH variable.

If you choose chromium, you'll have to strictly stick to Chrome browser version. As for Firefox, I have a pretty old geckodriver 0.29.1 and it works like a charm with the latest update.

Hands on

import pandas as pd
from selenium import webdriver

URL2 = "https://www.mykhel.com/football/indian-super-league-player-stats-l750/"

driver = webdriver.Firefox()
driver.get(URL2)

element = driver.find_element_by_xpath("//a[text()=' Load More.... ']")
while(element.is_displayed()):
    driver.execute_script("arguments[0].click();", element)

table = driver.find_element_by_css_selector('table')
tables2 = pd.read_html(table.get_attribute('outerHTML'))
driver.close()

overview_table2 = tables2[0].dropna(how='all').dropna(axis='columns', how='all')
overview_table2.drop_duplicates().reset_index(drop=True)
overview_table2

We only need pandas for our resulting table and selenium for web automation.
URL2 — is the same variable you used
driver = webdriver.Firefox() — here we instantiate Firefox and the browser will get opened. This is where selenium magic will happen.
Note: If you decided to skip adding driver to a PATH variable, you can directly reference your here, e.g.:
- webdriver.Firefox(r"C:\WebDriver\bin")
- webdriver.Chrome(service=Service(executable_path="/path/to/chromedriver"))
driver.get(URL2) — open the desired page
element = driver.find_element_by_xpath("//a[text()=' Load More.... ']")
Using xpath selector we find a link that has the same text as your 20th row.
With that stored element we click it all the time till it disappears.
It would be more sensible and easy to just use element.click(), but it results in an error. More info on other stack overflow question.
Assign table variable with a corresponding element.
tables2 I left this weird variable name as is in your question.
Here we get outerHTML as innnerHTML would render contents of the <table> tag, but not the tag itself.
We should not forget to .close() our driver as we don't need it anymore.
As a result of html parsing there will be a list just like in question provided. I drop here the unnamed column and last empty row.

The resulting overview_table2 looks like:

	Player Name	Team	Matches	Goals	Time Played
0	Jorge Pereyra Diaz	Mumbai City	9.0	6.0	538 Mins
1	Cleiton Silva	SC East Bengal	8.0	5.0	707 Mins
2	Abdenasser El Khayati	Chennaiyin FC	5.0	4.0	231 Mins
...	...	...	...	...	...
270	Michael Jakobsen	NorthEast United	8.0	0.0	676 Mins
271	Pratik Chowdhary	Jamshedpur FC	6.0	0.0	495 Mins
272	Chungnunga Lal	SC East Bengal	8.0	0.0	720 Mins

Side note

Job done. As some further improvement you may play with different browsers and try the headless mode, a mode when browser does not open on you desktop environment, but rather runs silently in the background.

Unable to load tables with "Load more" options in a website using Python

2 Answers2

Installation

Hands on

Side note