2

i have this code below that is extracting information from a website using selenium the code works fine but is pretty slow i was wondering if there is any thing i could change to make the program go faster

from selenium import webdriver
from bs4 import BeautifulSoup
dat =[]

for m in range(1,10000):
driver = webdriver.Chrome()
driver.get("http://www.ultimatetennisstatistics.com/playerProfile?playerId="+str(m))
dat.append([driver.find_element_by_xpath('/html/body/h3').text])
dat.append(m)
try:
   dropdown = driver.find_element_by_xpath('//*[@id="playerPills"]/li[9]/a')
   dropdown.click()
   bm = driver.find_element_by_id('statisticsPill')
   bm.click()
   driver.maximize_window()
   soup = BeautifulSoup(driver.page_source,"lxml")
   for i in soup.select('#statisticsOverview table tr'):
     dat.append([x.get_text(strip=True) for x in i.select("th,td")])
   driver.quit()

except ValueError:
      print("error")
dat.append('????')
smith
  • 200
  • 3
  • 12
  • 1
    Reuse the driver. You're basically using 10000 different ones. That takes a lot of time. So instantiate driver before your for-loop. And move the quit-command down at the bottom. You could consider reverse engimeering the site and see if it's possible to extract the data without selenium. Use chrome and imspect what happens in the networls tab once you click the bm-element. Perhaps you can fetch the data from the same endpoimt directly - like the site does. That's most often possible and much faster than browser automation... – jlaur Aug 11 '18 at 11:55

1 Answers1

3

Don't create a new driver instance for each iteration. There is hardly any time taken by your script to extract the data. The majority of it is spent only on opening the browser and loading the URL again and again.

Here's what I did with your code -

1) Placed the driver initialization and the driver.quit() outside the loop.

2) Used selenium webdriver itself to scrape the data instead of beautiful soup as the results of the latter were not consistent and reliable since the data is coming from javascript. (Plus there is no need of an external library, you can get all your data from selenium itself.)

3) Used javascript to open urls so that we could wait only for the relevant stuff (using WebDriverWait) in your website to appear instead of it to load in its entirety.

The final code took less than half time than your original code to scrape the data. (Measured via this method for 3 iterations)

EDIT -

There are some pages like this which does not have required statistics. In that case below line will throw a TimeoutException -

rows = small_wait.until(EC.presence_of_all_elements_located((By.XPATH,"//div[@id = 'statisticsOverview']//tr")))

So you can simply handle that exception and instead check whether No Statistics available" element is present or not (using is_displayed()).

Final Code -

from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
import time

dat =[]
driver = webdriver.Chrome()  
driver.maximize_window()
wait = WebDriverWait(driver, 10)
small_wait = WebDriverWait(driver, 4)    #because performance is a concern

for m in range(0,10000):
    driver.execute_script('window.open("http://www.ultimatetennisstatistics.com/playerProfile?playerId=' + str(m) + '","_self")')
    dat.append([wait.until(EC.presence_of_element_located((By.XPATH, '/html/body/h3'))).text])
    dat.append(m)
    try:
        dropdown = driver.find_element_by_xpath('//*[@id="playerPills"]/li[9]/a')
        dropdown.click()
        bm = driver.find_element_by_id('statisticsPill')
        bm.click()
        try:
            rows = small_wait.until(EC.presence_of_all_elements_located((By.XPATH,"//div[@id = 'statisticsOverview']//tr")))
            for i in rows:
                dat.append([i.text])
        except TimeoutException:
            no_statistics_element = small_wait.until(EC.presence_of_element_located((By.XPATH, "//div[@id='playerStatsTab']/p[contains(text(),'No statistics available')]")))
            if(no_statistics_element.is_displayed()):
                dat.append([no_statistics_element.text])
                continue
    except ValueError:
        print("error")
    dat.append('????')   

driver.quit()
Saad
  • 3,340
  • 2
  • 10
  • 32
Shivam Mishra
  • 1,731
  • 2
  • 11
  • 29
  • hey thanks for your help i tweaked your code a little bit now it works pretty fast – smith Aug 11 '18 at 18:41
  • hey it actually gives me an error i think because some webpages dont have any stats like this http://www.ultimatetennisstatistics.com/playerProfile?playerId=45103 can you help me tweak the code for pages that dont have stats – smith Aug 12 '18 at 03:36
  • i get a TimeOutException – smith Aug 12 '18 at 03:40