0

Im trying to scrape a review website (similar to Trustpilot). Firstly, i got a list of ~50k links of urls (complains) to scrape. Then, im scraping specific data from each url/complain.

Problem is, my forloop is getting increasingly slower. It began scraping an url every 3 seconds, but now its rate is at 20s/iteration.

Could someone review my code and point out potential flaws?

Tks

for url in tqdm(urls):
    driver.get(url)
    count +=1
    try:
        df_load = pd.DataFrame({'id' : [counta],
         'caption' : [driver.find_element_by_xpath(
            '//*[@id="complain-detail"]/div/div[1]/div[2]/div/div[1]/div[2]/div[1]/h1').text],
         'details': [driver.find_element_by_xpath(
            '//*[@id="complain-detail"]/div/div[1]/div[2]/div/div[1]/div[2]/div[1]/ul[1]').text],
         'status' : [driver.find_element_by_xpath(
            '//*[@id="complain-detail"]/div/div[1]/div[2]/div/div[1]/div[2]/div[3]/span[2]/strong').text],
         'complaint' : [driver.find_element_by_xpath(
            '//*[@id="complain-detail"]/div/div[1]/div[2]/div/div[2]/p').text]})
        df = pd.concat([df_load, df])
    except:
        print(f'ID {counta} did not work')
        pass
tacd
  • 5
  • 2
  • 1
    Do you need to render JavaScript on the page you are scraping, or are you just parsing the HTML? – Yannick Funk Sep 14 '20 at 13:44
  • i tried directly parsing the html with scrapy but wasnt successful (website returned that i was running an outdated browser, despite randomly assigning user agents). – tacd Sep 14 '20 at 13:55
  • Ok, so Selenium might be the right choice. Which webdriver are you using? – Yannick Funk Sep 14 '20 at 13:56
  • chrome, v85 (same as my browser) – tacd Sep 14 '20 at 13:58
  • Try to use PhantomJS (creates no overhead displaying the webpage) if it works I can post this as an answer. You can find a good introduction here: https://realpython.com/headless-selenium-testing-with-python-and-phantomjs/ – Yannick Funk Sep 14 '20 at 13:59
  • got this warning/error - warnings.warn('Selenium support for PhantomJS has been deprecated, please use headless) – tacd Sep 14 '20 at 14:35

1 Answers1

0

Using concat on a dataframe is very slow and heavy on memory, each time it's called, a new copy of the dataframe is returned. See the replies here for greater detail:

Why does concatenation of DataFrames get exponentially slower?

Dean
  • 3
  • 5
  • tks. should i consider appending the scraped data to a list under the loop – tacd Sep 14 '20 at 14:37
  • Exactly! Append to a list on each iteration inside the loop, and once the loop is done you can concatenate the list. – Dean Sep 14 '20 at 14:42
  • tks Dean. Any ideas how to accelerate further this code? Im trying to scrape 50k+ urls lol – tacd Sep 14 '20 at 14:52
  • Do you need to use selenium to grab the specific data? I'd consider using Requests and BeautifulSoup instead maybe if not – Dean Sep 14 '20 at 14:55
  • i tried directly parsing the html with scrapy but wasnt successful (website returned that i was running an outdated browser, despite randomly assigning user agents). – tacd Sep 14 '20 at 14:56
  • Dean im not sure this worked...iterations are now taking 20 secs under this list method – tacd Sep 14 '20 at 16:15