I`m have done Web Scraping
for reviews websites perharps the time spent on running the code is exhausting, cause at least spent a day for 20k reviews approx.
To solve this problem, I've been looking for new possibilities, the one that was most common is parallelize tasks.
So I tried to implement ThreadPool
from multiprocessing.dummy.Pool
but I 've always seen the same mistake / error, I don´t know how to solve it.
The common error obtained is:
Message: stale element reference: element is not attached to the page document (Session info: chrome=96.0.4664.110)
The code I developed is the next one (all the extract info is stored in lists, perharps I've thought to use dicts.... Any suggestions would be appreciated):
Step 1: Start with an input about the search item you want to obtain reviews.
Step 2: Obtain all the items (urls) related to the input through pagination (click buttons if they enable extraction of more items) and also quick information shown as cover.
Step 3: For each item url, access the content: driver.get(url)
, then extract detailed info about the item product (price: actual and previous, rating avg, color, shape, description ... as much info as possible) , all this info is stored in another list for all the items details.
Step 4: Following in the same url, try/except if it is possible to access "view all reviews", just click button related to. And again pagination of all the urls containing reviews about that specific item.
All of these step have been successfully completed, and time spent was no long as extracting reviews.
For example a item with 2K reviews, the previous steps spent around 10 minutes, but the process of extracting information from the reviews was about 4 hours.
I have to admit that for each review I search elements (title,rating,verified,content,useful votes,review...) with the correspondant id from the user, so this might be a cause of spending too much time.
Anyways I try to solve or fix this problem with ThreadPool
.
The function extract_info_per_review(url)
is the following:
def extract_info_per_review(url):
#header_reviews = ['Username','Rating','Title','Date','Size','Verified','Review','Images','Votes']
driver.get(url)
all_reviews_page = list()
all_reviews = driver.find_elements(By.XPATH, "//div[@class='a-section review aok-relative']")
for review in all_reviews:
### Option 1:
user_id = review.get_attribute('id')
try_data_review = list()
### Option 2:
# try:
# user_id = review.get_attribute('id')
# try_data_review = list()
# except:
# print('Not id found')
# pass
### Option 3:
# try_data_review = list()
# ignored_exceptions=(NoSuchElementException,StaleElementReferenceException,)
# user_id = WebDriverWait(driver, 60,ignored_exceptions=ignored_exceptions).until(expected_conditions.presence_of_element_located(review.get_attribute('id')))
try:
try_data_review.append(driver.find_element(By.XPATH,"//div[@id='{}']//span[@class='a-profile-name']".format(user_id)).text)
except:
try_data_review.append('Not username')
try:
try_data_review.append(driver.find_element(By.XPATH,"//div[@id='{}']//i[@data-hook='review-star-rating']//span[@class='a-icon-alt']".format(user_id)).get_attribute('innerHTML'))
except:
try_data_review.append('Not rating')
try:
try_data_review.append(driver.find_element(By.XPATH,"//div[@id='{}']//span[@class='cr-original-review-content']".format(user_id)).text)
except:
try_data_review.append('Not title')
try:
try_data_review.append(driver.find_element(By.XPATH,"//div[@id='{}']//span[@data-hook='review-date']".format(user_id)).text)
except:
try_data_review.append('Not date')
try:
try_data_review.append(driver.find_element(By.XPATH,"//div[@id='{}']//a[@data-hook='format-strip']".format(user_id)).text)
except:
try_data_review.append('Not size')
try:
try_data_review.append(driver.find_element(By.XPATH,"//div[@id='{}']//span[@data-action='reviews:filter-action:push-state']".format(user_id)).text)
except:
try_data_review.append('Not verified')
try:
try_data_review.append(driver.find_element(By.XPATH,"//div[@id='{}']//span[@class='a-size-base review-text review-text-content']".format(user_id)).text)
except:
try_data_review.append('Not review')
try:
try_images = list()
images = driver.find_elements(By.XPATH,"//div[@id='{}']//div[@class='review-image-tile-section']//img[@alt='Imagen del cliente']".format(user_id))
for image in images:
try_images.append(image.get_attribute('src'))
try_data_review.append(try_images)
except:
try_data_review.append('Not image')
try:
try_data_review.append(driver.find_element(By.XPATH,"//div[@id='{}']//span[@class='a-size-base a-color-tertiary cr-vote-text']".format(user_id)).text)
except:
try_data_review.append('Not votes')
all_reviews_page.append(try_data_review)
print('Review extraction done')
return all_reviews_page
And the implementation to the ThreadPool is:
data_items_reviews = list()
try:
reviews_view_sort()
reviews_urls = list()
urls_review_from_item = pagination_data_urls()
time.sleep(2)
### V1: Too much time spent
# for url in urls_review_from_item:
# reviews_urls.append(extract_info_per_review(url))
# time.sleep(2)
# data_items_reviews.append(reviews_urls)
### V2: Try ThreadPool
pool = ThreadPool(3) # 1,3,5,7
results = pool.map(extract_info_per_review, urls_review_from_item)
data_items_reviews.append(results)
except:
print('Item without reviews')
data_items_reviews.append('Item without reviews')
All the implementations:
import random
from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
from selenium.common.exceptions import StaleElementReferenceException
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions
from multiprocessing.dummy import Pool as ThreadPool
import time
from webdriver_manager.chrome import ChromeDriverManager
from selenium.common.exceptions import NoSuchElementException
from selectorlib import Extractor
import os
from datetime import date
import shutil
import json
import pandas as pd
from datetime import datetime
import csv
Last imports come from another task in relation to how to store information more efficiently, I should ask another question..,
I'm stuck, any recommendations I'm all for it. Thanks to all of you, I'm in touch.