Don't create a new driver instance for each iteration. There is hardly any time taken by your script to extract the data. The majority of it is spent only on opening the browser and loading the URL again and again.
Here's what I did with your code -
1) Placed the driver initialization and the driver.quit()
outside the loop.
2) Used selenium webdriver itself to scrape the data instead of beautiful soup as the results of the latter were not consistent and reliable since the data is coming from javascript. (Plus there is no need of an external library, you can get all your data from selenium itself.)
3) Used javascript to open urls so that we could wait only for the relevant stuff (using WebDriverWait
) in your website to appear instead of it to load in its entirety.
The final code took less than half time than your original code to scrape the data. (Measured via this method for 3 iterations)
EDIT -
There are some pages like this which does not have required statistics. In that case below line will throw a TimeoutException
-
rows = small_wait.until(EC.presence_of_all_elements_located((By.XPATH,"//div[@id = 'statisticsOverview']//tr")))
So you can simply handle that exception and instead check whether No Statistics available" element is present or not (using is_displayed()
).
Final Code -
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
import time
dat =[]
driver = webdriver.Chrome()
driver.maximize_window()
wait = WebDriverWait(driver, 10)
small_wait = WebDriverWait(driver, 4) #because performance is a concern
for m in range(0,10000):
driver.execute_script('window.open("http://www.ultimatetennisstatistics.com/playerProfile?playerId=' + str(m) + '","_self")')
dat.append([wait.until(EC.presence_of_element_located((By.XPATH, '/html/body/h3'))).text])
dat.append(m)
try:
dropdown = driver.find_element_by_xpath('//*[@id="playerPills"]/li[9]/a')
dropdown.click()
bm = driver.find_element_by_id('statisticsPill')
bm.click()
try:
rows = small_wait.until(EC.presence_of_all_elements_located((By.XPATH,"//div[@id = 'statisticsOverview']//tr")))
for i in rows:
dat.append([i.text])
except TimeoutException:
no_statistics_element = small_wait.until(EC.presence_of_element_located((By.XPATH, "//div[@id='playerStatsTab']/p[contains(text(),'No statistics available')]")))
if(no_statistics_element.is_displayed()):
dat.append([no_statistics_element.text])
continue
except ValueError:
print("error")
dat.append('????')
driver.quit()