Python selenium loop getting slower and slower

Question

Im trying to scrape a review website (similar to Trustpilot). Firstly, i got a list of ~50k links of urls (complains) to scrape. Then, im scraping specific data from each url/complain.

Problem is, my forloop is getting increasingly slower. It began scraping an url every 3 seconds, but now its rate is at 20s/iteration.

Could someone review my code and point out potential flaws?

Tks

for url in tqdm(urls):
    driver.get(url)
    count +=1
    try:
        df_load = pd.DataFrame({'id' : [counta],
         'caption' : [driver.find_element_by_xpath(
            '//*[@id="complain-detail"]/div/div[1]/div[2]/div/div[1]/div[2]/div[1]/h1').text],
         'details': [driver.find_element_by_xpath(
            '//*[@id="complain-detail"]/div/div[1]/div[2]/div/div[1]/div[2]/div[1]/ul[1]').text],
         'status' : [driver.find_element_by_xpath(
            '//*[@id="complain-detail"]/div/div[1]/div[2]/div/div[1]/div[2]/div[3]/span[2]/strong').text],
         'complaint' : [driver.find_element_by_xpath(
            '//*[@id="complain-detail"]/div/div[1]/div[2]/div/div[2]/p').text]})
        df = pd.concat([df_load, df])
    except:
        print(f'ID {counta} did not work')
        pass

Do you need to render JavaScript on the page you are scraping, or are you just parsing the HTML? — Yannick Funk, Sep 14 '20 at 13:44
i tried directly parsing the html with scrapy but wasnt successful (website returned that i was running an outdated browser, despite randomly assigning user agents). — tacd, Sep 14 '20 at 13:55
Ok, so Selenium might be the right choice. Which webdriver are you using? — Yannick Funk, Sep 14 '20 at 13:56
Try to use PhantomJS (creates no overhead displaying the webpage) if it works I can post this as an answer. You can find a good introduction here: https://realpython.com/headless-selenium-testing-with-python-and-phantomjs/ — Yannick Funk, Sep 14 '20 at 13:59
got this warning/error - warnings.warn('Selenium support for PhantomJS has been deprecated, please use headless) — tacd, Sep 14 '20 at 14:35

score 0 · Answer 1 · answered Sep 14 '20 at 14:16

0

Using concat on a dataframe is very slow and heavy on memory, each time it's called, a new copy of the dataframe is returned. See the replies here for greater detail:

Why does concatenation of DataFrames get exponentially slower?

answered Sep 14 '20 at 14:16

Dean

3
5

tks. should i consider appending the scraped data to a list under the loop – tacd Sep 14 '20 at 14:37
Exactly! Append to a list on each iteration inside the loop, and once the loop is done you can concatenate the list. – Dean Sep 14 '20 at 14:42
tks Dean. Any ideas how to accelerate further this code? Im trying to scrape 50k+ urls lol – tacd Sep 14 '20 at 14:52
Do you need to use selenium to grab the specific data? I'd consider using Requests and BeautifulSoup instead maybe if not – Dean Sep 14 '20 at 14:55
i tried directly parsing the html with scrapy but wasnt successful (website returned that i was running an outdated browser, despite randomly assigning user agents). – tacd Sep 14 '20 at 14:56
Dean im not sure this worked...iterations are now taking 20 secs under this list method – tacd Sep 14 '20 at 16:15

Python selenium loop getting slower and slower

1 Answers1