Web Scraping with Selenium, trying to solve a parallelize mistake with ThreadPool

Question

I`m have done Web Scraping for reviews websites perharps the time spent on running the code is exhausting, cause at least spent a day for 20k reviews approx.

To solve this problem, I've been looking for new possibilities, the one that was most common is parallelize tasks.

So I tried to implement ThreadPool from multiprocessing.dummy.Pool but I 've always seen the same mistake / error, I don´t know how to solve it.

The common error obtained is: Message: stale element reference: element is not attached to the page document (Session info: chrome=96.0.4664.110)

The code I developed is the next one (all the extract info is stored in lists, perharps I've thought to use dicts.... Any suggestions would be appreciated):

Step 1: Start with an input about the search item you want to obtain reviews.

Step 2: Obtain all the items (urls) related to the input through pagination (click buttons if they enable extraction of more items) and also quick information shown as cover.

Step 3: For each item url, access the content: driver.get(url), then extract detailed info about the item product (price: actual and previous, rating avg, color, shape, description ... as much info as possible) , all this info is stored in another list for all the items details.

Step 4: Following in the same url, try/except if it is possible to access "view all reviews", just click button related to. And again pagination of all the urls containing reviews about that specific item.

All of these step have been successfully completed, and time spent was no long as extracting reviews.

For example a item with 2K reviews, the previous steps spent around 10 minutes, but the process of extracting information from the reviews was about 4 hours.

I have to admit that for each review I search elements (title,rating,verified,content,useful votes,review...) with the correspondant id from the user, so this might be a cause of spending too much time.

Anyways I try to solve or fix this problem with ThreadPool.

The function extract_info_per_review(url) is the following:

    def extract_info_per_review(url):
    #header_reviews = ['Username','Rating','Title','Date','Size','Verified','Review','Images','Votes']
    
    driver.get(url) 
        
    all_reviews_page = list()
    all_reviews = driver.find_elements(By.XPATH, "//div[@class='a-section review aok-relative']")
        
    for review in all_reviews:
        ### Option 1:
        user_id = review.get_attribute('id')
        try_data_review = list()


        ### Option 2:
#        try:
#            user_id = review.get_attribute('id')
#            try_data_review = list()
#        except:
#            print('Not id found')
#            pass

        ### Option 3:
#        try_data_review = list()
#        ignored_exceptions=(NoSuchElementException,StaleElementReferenceException,)
#        user_id = WebDriverWait(driver, 60,ignored_exceptions=ignored_exceptions).until(expected_conditions.presence_of_element_located(review.get_attribute('id')))

        try:
            try_data_review.append(driver.find_element(By.XPATH,"//div[@id='{}']//span[@class='a-profile-name']".format(user_id)).text)
        except:
            try_data_review.append('Not username')

        try:
            try_data_review.append(driver.find_element(By.XPATH,"//div[@id='{}']//i[@data-hook='review-star-rating']//span[@class='a-icon-alt']".format(user_id)).get_attribute('innerHTML'))  
        except:
            try_data_review.append('Not rating')

        try:
            try_data_review.append(driver.find_element(By.XPATH,"//div[@id='{}']//span[@class='cr-original-review-content']".format(user_id)).text)
        except:
            try_data_review.append('Not title')

        try:
            try_data_review.append(driver.find_element(By.XPATH,"//div[@id='{}']//span[@data-hook='review-date']".format(user_id)).text)
        except:
            try_data_review.append('Not date')

        try:
            try_data_review.append(driver.find_element(By.XPATH,"//div[@id='{}']//a[@data-hook='format-strip']".format(user_id)).text)
        except:
            try_data_review.append('Not size')

        try:
            try_data_review.append(driver.find_element(By.XPATH,"//div[@id='{}']//span[@data-action='reviews:filter-action:push-state']".format(user_id)).text)
        except:
            try_data_review.append('Not verified')

        try:
            try_data_review.append(driver.find_element(By.XPATH,"//div[@id='{}']//span[@class='a-size-base review-text review-text-content']".format(user_id)).text)
        except:
            try_data_review.append('Not review')

        try:
            try_images = list()
            images = driver.find_elements(By.XPATH,"//div[@id='{}']//div[@class='review-image-tile-section']//img[@alt='Imagen del cliente']".format(user_id))
            for image in images:
                try_images.append(image.get_attribute('src'))
            try_data_review.append(try_images)
        except:
            try_data_review.append('Not image')

        try:
            try_data_review.append(driver.find_element(By.XPATH,"//div[@id='{}']//span[@class='a-size-base a-color-tertiary cr-vote-text']".format(user_id)).text)
        except:
            try_data_review.append('Not votes')

        all_reviews_page.append(try_data_review)

    print('Review extraction done')

    return all_reviews_page

And the implementation to the ThreadPool is:

data_items_reviews = list()

try:
    reviews_view_sort() 
    reviews_urls = list()
    urls_review_from_item = pagination_data_urls()
    time.sleep(2)
    
    ### V1: Too much time spent
#    for url in urls_review_from_item:
#        reviews_urls.append(extract_info_per_review(url))
#        time.sleep(2)

#    data_items_reviews.append(reviews_urls)
    
    ### V2: Try ThreadPool
    pool = ThreadPool(3) # 1,3,5,7 
    results = pool.map(extract_info_per_review, urls_review_from_item)
    data_items_reviews.append(results)

except:    
    print('Item without reviews')
    data_items_reviews.append('Item without reviews')

All the implementations:

import random
from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
from selenium.common.exceptions import StaleElementReferenceException
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions
from multiprocessing.dummy import Pool as ThreadPool
import time
from webdriver_manager.chrome import ChromeDriverManager
from selenium.common.exceptions import NoSuchElementException


from selectorlib import Extractor
import os
from datetime import date
import shutil
import json
import pandas as pd
from datetime import datetime
import csv

Last imports come from another task in relation to how to store information more efficiently, I should ask another question..,

I'm stuck, any recommendations I'm all for it. Thanks to all of you, I'm in touch.

score 0 · Answer 1 · edited Jan 18 '22 at 11:20

'Message: stale element reference: element is not attached to the page document (Session info: chrome=96.0.4664.110)'

I believe you may need to create and manage separate Selenium driver instances for each thread or process. If thread 1 loads a page, and then thread 2 loads a page; all the state attached to the driver on thread 1 is invalidated (including elements, url, etc.). You need to have each thread create it's own driver which will be a separate browser instance. You should be able to create quite a few browsers at the same time, each processing one url at a time.

You could implement the multiprocessing via ThreadPool or ThreadPoolExecutor as you've started. I've had luck in the past using multiprocessing.Process along with multiprocessing.Queue (your queue could be a list of URLs and each process parses one URL at a time). Regardless, each thread or process needs to maintain it's own selenium driver instance.

A simple implementation could even skip any multiprocessing in python by having the entire scrape_one_url.py as one python script (creating it's browser/driver and performing the job); and a separate script could execute 'scrape_one_url.py' at the system-level. You could do multiprocessing just by starting 5-10 'scrape_one_url.py' scripts at the same time on different URLs.

https://docs.python.org/3/library/multiprocessing.html#multiprocessing.pool.ThreadPool
https://docs.python.org/3/library/multiprocessing.html#multiprocessing.Process
One note on Threading vs Multiprocessing in Python, threading will not help CPU-bound performance. This is well documented, see Multiprocessing vs Threading Python

Side note: You may be able to parse multiple pages at the same time by creating different browser tabs through Selenium. In this way, you may be able to get away with only one browser instance. This would be a different way to do it but I bet would be more difficult. — mathewguest, Jan 18 '22 at 11:17
Note: You could also only create a selenium browser instance only once per thread or process, and re-use between subsequent URLs. Just one at a time per thread. — mathewguest, Jan 18 '22 at 11:19
Thank you very much for your answer, I had read about this issue but I didn't think it would fail because on a small scale the ThreadPool worked. I will study the information in the links you have given me and try to implement it. Anyway, we'll be in touch, thanks. — sergio maestu aragon, Jan 19 '22 at 10:21

Web Scraping with Selenium, trying to solve a parallelize mistake with ThreadPool

1 Answers1