This is a dynamically loaded page, so you can not parse all the contents without hitting a button.
Well… may be you can with XHR or smth like that, may be someone will contribute to the answers here.
I'll stick to working with dynamically loaded pages with Selenium browser automation suite.
Installation
To get started, you'll need to install selenium bindings:
pip install selenium
You seem to already have beautifulsoup, but for anyone who might come across this answer, we'll also need it and html5lib
, we'll need them later to parse the table:
pip install html5lib BeautifulSoup4
Now, for selenium to work you'll need a driver installed for a browser of your choice. To get the drivers you may use Selenium Manager, Driver Management Software or download the drivers manually. The above mentioned options are something new, I have my manually downloaded drivers for ages, so I'll stick to them. I'll duplicate here the download links:
You can use any browser, e.g. Brave browser, Yandex Browser, basically any Chromium based browser of your choice or even Tor browser
Anyway, it's a bit out of this answer scope, just keep in mind, for any browser and it's family you'll need a driver.
I'll stick with Firefox. Hence you need Firefox installed and driver placed somewhere. The best option would be to add this folder to PATH
variable.
If you choose chromium, you'll have to strictly stick to Chrome browser version. As for Firefox, I have a pretty old geckodriver 0.29.1 and it works like a charm with the latest update.
Hands on
import pandas as pd
from selenium import webdriver
URL2 = "https://www.mykhel.com/football/indian-super-league-player-stats-l750/"
driver = webdriver.Firefox()
driver.get(URL2)
element = driver.find_element_by_xpath("//a[text()=' Load More.... ']")
while(element.is_displayed()):
driver.execute_script("arguments[0].click();", element)
table = driver.find_element_by_css_selector('table')
tables2 = pd.read_html(table.get_attribute('outerHTML'))
driver.close()
overview_table2 = tables2[0].dropna(how='all').dropna(axis='columns', how='all')
overview_table2.drop_duplicates().reset_index(drop=True)
overview_table2
- We only need
pandas
for our resulting table and selenium
for web automation.
URL2
— is the same variable you used
driver = webdriver.Firefox()
— here we instantiate Firefox and the browser will get opened. This is where selenium magic will happen.
Note: If you decided to skip adding driver to a PATH
variable, you can directly reference your here, e.g.:
webdriver.Firefox(r"C:\WebDriver\bin")
webdriver.Chrome(service=Service(executable_path="/path/to/chromedriver"))
driver.get(URL2)
— open the desired page
element = driver.find_element_by_xpath("//a[text()=' Load More.... ']")
Using xpath selector we find a link that has the same text as your 20th row.
- With that stored
element
we click it all the time till it disappears.
It would be more sensible and easy to just use element.click()
, but it results in an error. More info on other stack overflow question.
- Assign
table
variable with a corresponding element.
tables2
I left this weird variable name as is in your question.
Here we get outerHTML as innnerHTML would render contents of the <table>
tag, but not the tag itself.
- We should not forget to
.close()
our driver as we don't need it anymore.
- As a result of html parsing there will be a
list
just like in question provided. I drop here the unnamed column and last empty row.
The resulting overview_table2
looks like:
|
Player Name |
Team |
Matches |
Goals |
Time Played |
0 |
Jorge Pereyra Diaz |
Mumbai City |
9.0 |
6.0 |
538 Mins |
1 |
Cleiton Silva |
SC East Bengal |
8.0 |
5.0 |
707 Mins |
2 |
Abdenasser El Khayati |
Chennaiyin FC |
5.0 |
4.0 |
231 Mins |
... |
... |
... |
... |
... |
... |
270 |
Michael Jakobsen |
NorthEast United |
8.0 |
0.0 |
676 Mins |
271 |
Pratik Chowdhary |
Jamshedpur FC |
6.0 |
0.0 |
495 Mins |
272 |
Chungnunga Lal |
SC East Bengal |
8.0 |
0.0 |
720 Mins |
Side note
Job done. As some further improvement you may play with different browsers and try the headless mode, a mode when browser does not open on you desktop environment, but rather runs silently in the background.