2

Need to scrape the full table from this site with "Load more" option.

As of now when I`m scraping , I only get the one that shows up by default on when loading the page.

import pandas as pd
import requests
from six.moves import urllib

URL2 = "https://www.mykhel.com/football/indian-super-league-player-stats-l750/"
header = {'Accept-Language': "en-US,en;q=0.9",
          'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 "
                        "(KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36"
          }

resp2 = requests.get(url=URL2, headers=header).text

tables2 = pd.read_html(resp2)
overview_table2= tables2[0]
overview_table2
Player Name Team Matches Goals Time Played Unnamed: 5
0 Jorge Pereyra Diaz Mumbai City 9 6 538 Mins NaN
1 Cleiton Silva SC East Bengal 8 5 707 Mins NaN
2 Abdenasser El Khayati Chennaiyin FC 5 4 231 Mins NaN
3 Lallianzuala Chhangte Mumbai City 9 4 737 Mins NaN
4 Nandhakumar Sekar Odisha 8 4 673 Mins NaN
5 Ivan Kalyuzhnyi Kerala Blasters 7 4 428 Mins NaN
6 Bipin Singh Mumbai City 9 4 806 Mins NaN
7 Noah Sadaoui Goa 8 4 489 Mins NaN
8 Diego Mauricio Odisha 8 3 526 Mins NaN
9 Pedro Martin Odisha 8 3 263 Mins NaN
10 Dimitri Petratos ATK Mohun Bagan 6 3 517 Mins NaN
11 Petar Sliskovic Chennaiyin FC 8 3 662 Mins NaN
12 Holicharan Narzary Hyderabad 9 3 705 Mins NaN
13 Dimitrios Diamantakos Kerala Blasters 7 3 529 Mins NaN
14 Alberto Noguera Mumbai City 9 3 371 Mins NaN
15 Jerry Mawihmingthanga Odisha 8 3 611 Mins NaN
16 Hugo Boumous ATK Mohun Bagan 7 2 580 Mins NaN
17 Javi Hernandez Bengaluru 6 2 397 Mins NaN
18 Borja Herrera Hyderabad 9 2 314 Mins NaN
19 Mohammad Yasir Hyderabad 9 2 777 Mins NaN
20 Load More.... Load More.... Load More.... Load More.... Load More.... Load More....

But I need the full table , including the data under "Load more", please help.

Alexander L. Hayes
  • 3,892
  • 4
  • 13
  • 34

2 Answers2

4
import requests
import pandas as pd
from bs4 import BeautifulSoup

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:107.0) Gecko/20100101 Firefox/107.0'
}


def main(url):
    params = {
        "action": "stats",
        "league_id": "750",
        "limit": "300",
        "offset": "0",
        "part": "leagues",
        "season_id": "2022",
        "section": "football",
        "stats_type": "player",
        "tab": "overview"
    }
    r = requests.get(url, headers=headers, params=params)
    soup = BeautifulSoup(r.text, 'lxml')
    goal = [(x['title'], *[i.get_text(strip=True) for i in x.find_all_next('td', limit=4)])
            for x in soup.select('a.player_link')]
    df = pd.DataFrame(
        goal, columns=['Name', 'Team', 'Matches', 'Goals', 'Time Played'])
    print(df)


main('https://www.mykhel.com/src/index.php')

Output:

                      Name              Team Matches Goals Time Played
0       Jorge Pereyra Diaz       Mumbai City       9     6    538 Mins
1            Cleiton Silva    SC East Bengal       8     5    707 Mins
2    Abdenasser El Khayati     Chennaiyin FC       5     4    231 Mins
3    Lallianzuala Chhangte       Mumbai City       9     4    737 Mins
4        Nandhakumar Sekar            Odisha       8     4    673 Mins
..                     ...               ...     ...   ...         ...
268          Sarthak Golui    SC East Bengal       6     0    402 Mins
269          Ivan Gonzalez    SC East Bengal       8     0    683 Mins
270       Michael Jakobsen  NorthEast United       8     0    676 Mins
271       Pratik Chowdhary     Jamshedpur FC       6     0    495 Mins
272         Chungnunga Lal    SC East Bengal       8     0    720 Mins

[273 rows x 5 columns]
  • 1
    Can you please explain why you refer to the other link, an **index.php**, and where do you get those `params`? Is it a sort of API, is it documented for this website? – DiMithras Dec 03 '22 at 19:20
  • And what is `'lxml'` for, removing it does not change anything. – DiMithras Dec 03 '22 at 19:30
  • 1
    @DiMithras check [that](https://stackoverflow.com/a/60925669/7658985) and regarding `lxml` is just the type of parser which is the most quickest efficient one, removing it is letting the soup to use `html.parser` by default. – αԋɱҽԃ αмєяιcαη Dec 03 '22 at 20:40
  • @αԋɱҽԃαмєяιcαη ohh, cool! That explains a lot! I thought about dev tools and looked there, but missed clicking the button step. Now I can clearly see that link and params, also would make sense to put `Fetch/XHR` option to more easily find that record. Happy to learn something new today, much appreciated Yet [tag:Selenium] remains an option, doesn't it? – DiMithras Dec 03 '22 at 20:59
  • @DiMithras if you think that selenium is designed for web scraping, So you've to think twice on what you learned. Selenium is created for test cases purpose only. – αԋɱҽԃ αмєяιcαη Dec 03 '22 at 21:00
  • @αԋɱҽԃαмєяιcαη also for whatever reason I don't see `Params` tab in **Firefox**, but it can be found in **Chrome** under `Payload`. – DiMithras Dec 03 '22 at 21:06
  • @αԋɱҽԃαмєяιcαη yeah, I know it's a testing suite. As you say, "you don't need it for such a single page". But for something bigger? – DiMithras Dec 03 '22 at 21:08
  • @DiMithras https://i.imgur.com/DZ1F1bv.png isn't that a firefox ? and BTW it's not about small/bigger! selenium isn't for *web scraping* at all – αԋɱҽԃ αмєяιcαη Dec 03 '22 at 21:17
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/250122/discussion-between-dimithras-and--c). – DiMithras Dec 03 '22 at 21:18
  • Thanks Mate , but i`m not still able to understand how i`ll be able to scrape the data in other tables like "Passes" in the site. Also the result with main(url) function is a dataframe ? – Footilytics - Indian football Dec 05 '22 at 04:47
0

This is a dynamically loaded page, so you can not parse all the contents without hitting a button.
Well… may be you can with XHR or smth like that, may be someone will contribute to the answers here.

I'll stick to working with dynamically loaded pages with Selenium browser automation suite.

Installation

To get started, you'll need to install bindings:

pip install selenium

You seem to already have , but for anyone who might come across this answer, we'll also need it and html5lib, we'll need them later to parse the table:

pip install html5lib BeautifulSoup4

Now, for to work you'll need a driver installed for a browser of your choice. To get the drivers you may use Selenium Manager, Driver Management Software or download the drivers manually. The above mentioned options are something new, I have my manually downloaded drivers for ages, so I'll stick to them. I'll duplicate here the download links:

You can use any browser, e.g. Brave browser, Yandex Browser, basically any Chromium based browser of your choice or even Tor browser
Anyway, it's a bit out of this answer scope, just keep in mind, for any browser and it's family you'll need a driver.

I'll stick with Firefox. Hence you need Firefox installed and driver placed somewhere. The best option would be to add this folder to PATH variable.

If you choose chromium, you'll have to strictly stick to Chrome browser version. As for Firefox, I have a pretty old geckodriver 0.29.1 and it works like a charm with the latest update.

Hands on

import pandas as pd
from selenium import webdriver

URL2 = "https://www.mykhel.com/football/indian-super-league-player-stats-l750/"

driver = webdriver.Firefox()
driver.get(URL2)

element = driver.find_element_by_xpath("//a[text()=' Load More.... ']")
while(element.is_displayed()):
    driver.execute_script("arguments[0].click();", element)

table = driver.find_element_by_css_selector('table')
tables2 = pd.read_html(table.get_attribute('outerHTML'))
driver.close()

overview_table2 = tables2[0].dropna(how='all').dropna(axis='columns', how='all')
overview_table2.drop_duplicates().reset_index(drop=True)
overview_table2
  1. We only need pandas for our resulting table and selenium for web automation.
  2. URL2 — is the same variable you used
  3. driver = webdriver.Firefox() — here we instantiate Firefox and the browser will get opened. This is where magic will happen.
    Note: If you decided to skip adding driver to a PATH variable, you can directly reference your here, e.g.:
    • webdriver.Firefox(r"C:\WebDriver\bin")
    • webdriver.Chrome(service=Service(executable_path="/path/to/chromedriver"))
  4. driver.get(URL2) — open the desired page
  5. element = driver.find_element_by_xpath("//a[text()=' Load More.... ']")
    Using xpath selector we find a link that has the same text as your 20th row.
  6. With that stored element we click it all the time till it disappears.
    It would be more sensible and easy to just use element.click(), but it results in an error. More info on other stack overflow question.
  7. Assign table variable with a corresponding element.
  8. tables2 I left this weird variable name as is in your question.
    Here we get outerHTML as innnerHTML would render contents of the <table> tag, but not the tag itself.
  9. We should not forget to .close() our driver as we don't need it anymore.
  10. As a result of html parsing there will be a list just like in question provided. I drop here the unnamed column and last empty row.

The resulting overview_table2 looks like:

Player Name Team Matches Goals Time Played
0 Jorge Pereyra Diaz Mumbai City 9.0 6.0 538 Mins
1 Cleiton Silva SC East Bengal 8.0 5.0 707 Mins
2 Abdenasser El Khayati Chennaiyin FC 5.0 4.0 231 Mins
... ... ... ... ... ...
270 Michael Jakobsen NorthEast United 8.0 0.0 676 Mins
271 Pratik Chowdhary Jamshedpur FC 6.0 0.0 495 Mins
272 Chungnunga Lal SC East Bengal 8.0 0.0 720 Mins

Side note

Job done. As some further improvement you may play with different browsers and try the headless mode, a mode when browser does not open on you desktop environment, but rather runs silently in the background.

DiMithras
  • 605
  • 6
  • 14