How to efficiently scrape many elements (reviews) on a single page with Selenium

Question

I have been trying to scrape all the reviews available at a single url by repeatedly clicking on the button 'Show 6 more reviews'. I believe this problem would apply to anyone scrapping many dynamic elements at a single url with Selenium.

Problem: when the number of reviews is more than a few hundreds, the loop becomes very slow. I am using Selenium since the website involves Javascript.

HTML of button I am clicking (towards bottom of page)

<button type="button" class="css-1e0935c" data-comp="Link Box">Show 6 more reviews<svg viewBox="0 0 95 57" class="css-1ymrwr7" data-comp="Chevron Box"><path d="M47.5 57L95 9.5 85.5 0l-38 38-38-38L0 9.5 47.5 57z"></path></svg></button>

Things I tried:

not loading images: no improvement (not shown below)
using most efficient selector possible in loop

Things I thought of:

Replacing Chrome with PhantomJS. I’m having issues scrolling with PhantomJS. I didn’t pursue as it seems the gains would be incremental, not orders of magnitude which I need (I could be wrong).
Loading the reviews as they become available instead of when the whole page has been 'expanded'. I don’t think this would solve the performance problem
finding a way to parse the button faster. I read about how browsers match CSS selectors but couldn't find a way to improve my code

This is my first SO question. Thank you very much for your patience and help.

Reproducible code in python 2 or 3 (the slow loop is at the bottom):

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time

# Page with 5591 reviews
url = "https://www.sephora.com/product/soy-face-cleanser-P7880?icid2=:p7880:product"

driver = webdriver.Chrome()
driver.get(url)
time.sleep(4)

# Navigation steps  (feel free to skip)
# scroll to section 'Similar Products' (above Reviews)
timeout = 10
wait_driver = WebDriverWait(driver, timeout)
section_title = wait_driver.until(EC.presence_of_element_located(\
        (By.XPATH, '//h2[@class="css-1orm38z"]')))
driver.execute_script("arguments[0].scrollIntoView();", section_title)
# Sort by newest review
wait_driver.until(EC.presence_of_element_located(\
        (By.XPATH, '//button[@class="css-u2mtre"]'))).click()
wait_driver.\
    until(EC.presence_of_element_located(\
        (By.XPATH, '//div/span[text()="Newest"]'))).click()

# This is the loop that is way too slow

# First expand all reviews by clicking button
numReviews = 0
while True:
    try:
        # Fastest selector I could come up with
        button = driver.find_element_by_css_selector(' .css-1e0935c')
        button.click()

        numReviews += 6
        print("Loading 6 more reviews... (" + str(numReviews) + ")")

    except Exception:
        break

# Now that full page is loaded, store all reviews
# [...]

Output:

Loading 6 more reviews... (6)
Loading 6 more reviews... (12)
Loading 6 more reviews... (18)
Loading 6 more reviews... (24)
Loading 6 more reviews... (30)
Loading 6 more reviews... (36)
Loading 6 more reviews... (42)
Loading 6 more reviews... (48)
Loading 6 more reviews... (54)
Loading 6 more reviews... (60)
Loading 6 more reviews... (66)
Loading 6 more reviews... (72)
Loading 6 more reviews... (78)
Loading 6 more reviews... (84)

etc... My program runs fine for a product with say 200 reviews but the above takes more and more time as the number of reviews increases (e.g. >5000 in my sample url).

Which part is actually slow? Is it the page loading or your code scraping it? — SuperStew, Jan 25 '18 at 21:31
It is the page loading the additional reviews. That while loop is what takes forever (before the actual scraping) — quentin, Jan 25 '18 at 21:41
You could easily scrap the 5599 comments in a few seconds with an http request executed directly in the page. Execute `performance.getEntries().find(e => e.name.includes("reviews.json")).name` in your console to get the url. — Florent B., Jan 25 '18 at 22:34
Thanks (wow). I got an url that has 'Limit=30' and if I try to increase it I get the error "Invalid limit value: 5000, as limit cannot be greater than 100" — quentin, Jan 25 '18 at 23:12
@FlorentB. Is there something you need to do before running that code for it to work? I'm getting, `Uncaught TypeError: Cannot read property 'name' of undefined at :1:68`. I tried running a profile first in the Performance tab but that didn't work either. I'm using Chrome. — JeffC, Jan 26 '18 at 00:58
@JeffC, you need to display the comments in page before running the code. It simply selects the Ajax request executed by the page which returned the reviews. — Florent B., Jan 26 '18 at 01:30
@FlorentB do you mind explaining briefly how you found the right console incantation? That would be tremendously helpful to me. — quentin, Jan 27 '18 at 21:58
@quentin, the Javascript API is well documented and to find the URL returning the data, simply monitor the requests via devtools. — Florent B., Jan 29 '18 at 10:50

How to efficiently scrape many elements (reviews) on a single page with Selenium

0 Answers0

Linked