2

I am making a web crawler/scraper using Python and Scrapy. Because some websites load their content dynamically, i´m also using Selenium in combination with PhantomJs. Now when i started using this i thought the performance would be acceptable, but turns out it´s quite slow. Now i´m not sure if that is because of some loophole in my code, or because the frameworks/programs i'm using are not optimised enough. So i´m asking you guy´s about suggestions about what i could do to improve the performance.
The code i wrote takes approx. 35 sec to start and end. It´s executing about 11 GET requests and 3 Post requests.

import scrapy
from scrapy.http.request import Request
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.support.ui import WebDriverWait
import time


class TechcrunchSpider(scrapy.Spider):
    name = "techcrunch_spider_performance"
    allowed_domains = ['techcrunch.com']
    start_urls = ['https://techcrunch.com/search/heartbleed']



    def __init__(self):
        self.driver = webdriver.PhantomJS()
        self.driver.set_window_size(1120, 550)
        #self.driver = webdriver.Chrome("C:\Users\Daniel\Desktop\Sonstiges\chromedriver.exe")
        self.driver.wait = WebDriverWait(self.driver, 5)    #wartet bis zu 5 sekunden

    def parse(self, response):
        start = time.time()     #ZEITMESSUNG
        self.driver.get(response.url)

        #wartet bis zu 5 sekunden(oben definiert) auf den eintritt der condition, danach schmeist er den TimeoutException error
        try:    

            self.driver.wait.until(EC.presence_of_element_located(
                (By.CLASS_NAME, "block-content")))
            print("Found : block-content")

        except TimeoutException:
            self.driver.close()
            print(" block-content NOT FOUND IN TECHCRUNCH !!!")


        #Crawle durch Javascript erstellte Inhalte mit Selenium

        ahref = self.driver.find_elements(By.XPATH,'//h2[@class="post-title st-result-title"]/a')

        hreflist = []
        #Alle Links zu den jeweiligen Artikeln sammeln
        for elem in ahref :
            hreflist.append(elem.get_attribute("href"))


        for elem in hreflist :
            print(elem)



        print("im closing myself")
        self.driver.close()
        end = time.time()
        print("Time elapsed : ")
        finaltime = end-start
        print(finaltime)

I am using Windows 8 64bit , intel i7-3630QM CPU @ 2,4GHZ , Nvidia Geforce GT 650M, 8GB Ram.
PS: sorry for German comments

Chanda Korat
  • 2,453
  • 2
  • 19
  • 23
BlackBat
  • 51
  • 1
  • 12
  • 1
    You could try generating the AJAX-requests through your spider, thus eliminating the need for Selenium and not needing to wait 5 seconds for the page to be loaded. Check this [frequent post](https://stackoverflow.com/questions/8550114/can-scrapy-be-used-to-scrape-dynamic-content-from-websites-that-are-using-ajax). – rongon Jun 13 '17 at 08:14
  • 1
    Read the answer of this question https://stackoverflow.com/questions/39036137/how-yo-make-a-selenium-scripts-faster – parik Jun 13 '17 at 08:41

2 Answers2

2

I was also facing this same issue getting only 2 url processed per minute.

I cache web page by doing this.

......
options = ['--disk-cache=true']
self.driver = webdriver.PhantomJS(service_args=options)
......

This shoot up the url processing from 2 to 11 per minute in case. This may very from web page to web page.

In case, you want to disable image loading to speed up page loading in selenium, add --load-images=false to options above.

Hope it helps.

Om Prakash
  • 2,675
  • 4
  • 29
  • 50
1

Try using Splash to process pages with Javascript instead.

graph
  • 77
  • 4