0

So I am scraping reviews and skin type from Sephora and have run into a problem identifying how to get elements off of the page. Sephora.com loads reviews dynamically after you scroll down the page so I have switched from beautiful soup to Selenium to get the reviews.

The Reviews have no ID, no name, nor a CSS identifier that seems to be stable. The Xpath doesn't seem to be recognized each time I try to use it by copying from chrome nor from firefox.

Here is an example of the HTML from the inspected element that I loaded in chrome: Inspect Element view from the desired page

My Attempts thus far:

from selenium import webdriver
from selenium.webdriver.common.keys import Keys

driver = webdriver.Chrome("/Users/myName/Downloads/chromedriver")
url = 'https://www.sephora.com/product/the-porefessional-face-primer-P264900'
driver.get(url)
reviews = driver.find_elements_by_xpath(
    "//div[@id='ratings-reviews']//div[@data-comp='Ellipsis Box ']")

print("REVIEWS:", reviews)

Output:

| => /Users/myName/anaconda3/bin/python "/Users/myName/Documents/ScrapeyFile Group/attempt32.py"
REVIEWS: []
(base) 

So basically an empty list.

ATTEMPT 2:

import time

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.common.exceptions import NoSuchElementException

from selenium.webdriver.common.keys import Keys

# Open up a Firefox browser and navigate to web page.
driver = webdriver.Firefox()
driver.get(
    "https://www.sephora.com/product/squalane-antioxidant-cleansing-oil-P416560?skuId=2051902&om_mmc=ppc-GG_1165716902_56760225087_pla-420378096665_2051902_257731959107_9061275_c&country_switch=us&lang=en&ds_rl=1261471&gclid=EAIaIQobChMIisW0iLbK6AIVaR6tBh005wUTEAYYBCABEgJVdvD_BwE&gclsrc=aw.ds"
)

#Scroll to bottom of page b/c its dynamically loading
html = driver.find_element_by_tag_name('html')
html.send_keys(Keys.END)

#scrape stats and comments
comments = driver.find_elements_by_css_selector("div.css-7rv8g1")

print("!!!!!!Comments!!!!!")
print(comments)

OUTPUT:

| => /Users/MYNAME/anaconda3/bin/python /Users/MYNAME/Downloads/attempt33.py
!!!!!!Comments!!!!!
[]
(base)

Empty again. :(

I get the same results when I try to use different element selectors:

#scrape stats and comments
comments = driver.find_elements_by_class_name("css-7rv8g1")

I also get nothing when I tried this:

comments = driver.find_elements_by_xpath(
    "//div[@data-comp='GridCell Box']//div[@data-comp='Ellipsis Box ']")

and This (notice the space after Ellipsis Box is gone :

comments = driver.find_elements_by_xpath(
    "//div[@data-comp='GridCell Box']//div[@data-comp='Ellipsis Box']")

I have tried using the solutions outlined here and here but ti no avail -- I think there is something I don't understand about the page or selenium that I am missing since this is my first time using selenium so i'm a super nube :(

2 Answers2

0
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.common.action_chains import ActionChains
import time
from selenium import webdriver
driver = webdriver.Chrome(executable_path=r"")
driver.maximize_window()
wait = WebDriverWait(driver, 20)
driver.get("https://www.sephora.fr/p/black-ink---classic-line-felt-liner---eyeliner-feutre-precis-waterproof-P3622017.html")
scrolls = 1
while True:
            scrolls -= 1
            driver.execute_script("window.scrollTo(0, document.body.scrollHeight)")
            time.sleep(3)
            if scrolls < 0:
                break

reviewText=wait.until(EC.presence_of_all_elements_located((By.XPATH, "//ol[@class='bv-content-list bv-content-list-reviews']//li//div[@class='bv-content-summary-body']//div[1]")))
for textreview in reviewText:
      print textreview.text

Output:

enter image description here

SeleniumUser
  • 4,065
  • 2
  • 7
  • 30
  • so when I did this I got: `Traceback (most recent call last): File "/Users/myName/Downloads/tests.py", line 30, in "//ol[@class='bv-content-list bv-content-list-reviews']//li//div[@class='bv-content-summary-body']//div[1]" File "/Users/myName/anaconda3/lib/python3.6/site-packages/selenium/webdriver/support/wait.py", line 80, in until raise TimeoutException(message, screen, stacktrace) selenium.common.exceptions.TimeoutException: Message: ` It seems the format for the french website is 100% different making it a bit harder :( – apples oranges Apr 10 '20 at 17:51
  • @applesoranges: Please find working solution and once you verify would you kind enough to accept answer and hit upvote button from your end. – SeleniumUser Apr 12 '20 at 00:15
0

I've been scraping reviews from Sephora and basically, even if there is plenty of room for improvement, it works like this :

Clicks on "reviews" to access reviews Loads all reviews by scrolling until there aren't any review left to load Finds review text and skin type by CSS SELECTOR


def load_all_reviews(driver):
    while True:
        try:
            driver.execute_script(
                "arguments[0].scrollIntoView(true);",
                WebDriverWait(driver, 10).until(
                    EC.visibility_of_element_located(
                        (By.CSS_SELECTOR, ".bv-content-btn-pages-load-more")
                    )
                ),
            )
  
            driver.execute_script(
                "arguments[0].click();",
                WebDriverWait(driver, 20).until(
                    EC.element_to_be_clickable(
                        (By.CSS_SELECTOR, ".bv-content-btn-pages-load-more")
                    )
                ),
            )
        except Exception as e:
            break


def get_review_text(review):
    try:
        return review.find_element(By.CLASS_NAME, "bv-content-summary-body-text").text 
    except:
        return "NA" # in case it doesnt find a review

def get_skin_type(review):
    try:
        return review.find_element(By.XPATH, '//*[@id="BVRRContainer"]/div/div/div/div/ol/li[2]/div[1]/div/div[2]/div[5]/ul/li[4]/span[2]').text 
    except:
        return "NA" # in case it doesnt find a skin type

to use those you've got to create a webdriver and first call the load_all_reviews() function. Then you've got to find reviews with :

reviews = driver.find_elements(By.CSS_SELECTOR, ".bv-content-review")

and finally you can call for each review the get_review() and get_skin_type() functions :

for review in reviews :
  print(get_review_text(review))
  print(get_skin_type(review))
m0r
  • 102
  • 9