2

I am web scraping a review page using Selenium in Python. I want to extract the rating of each review (ie. Extract 7 from 7/10 in a review). The HTML element constructs like this:

    <div class ="review">
         <div class="rating-bar">
            <span class="user-rating">
               <svg class="ipl-icon ipl-star-icon 
                "xmlns="http://www.w3.org/2000/svg" fill="#000000" height="24" 
                 viewBox="0 0 24 24" width="24"> <path d="M0 0h24v24H0z" 
                 fill="none"></path> <path d="M12 17.27L18.18 21l-1.64-7.03L22 
                 9.24l-7.19-.61L12 2 9.19 8.63 2 9.24l5.46 4.73L5.82 21z"> 
                </path> <path d="M0 0h24v24H0z" fill="none"></path> </svg>
               <span>7</span>             # What I want to extract
               <span class='scale'>/10</span>
             </span>
            </div>

The element does not have any class name, so I assume to extract it using the class user-rating under the span tag:

    rating = driver.find_elements_by_class_name('user-rating')

But how should I extract the span tag within another span tag? I cannot refer it to any class name.

In addition, not every review contains a rating, so when it scrapes to a review without rating, it prompts me the error:

    NoSuchElementException: Message: no such element: Unable to locate element: {"method":"css selector","selector":".rating-other-user-rating"} (Session info: chrome=87.0.4280.66)

This is what I have tried out so far:

    review = driver.find_elements_by_class_name("review")
    rating_ls = []
    
    for i in review:
        rating = i.find_element_by_class_name('rating-other-user-rating').text
        # If rating exists, append it to the list, otherwise append "N/A" 
        rating_ls.append(rating[0] if rating else "N/A")   

I appreciate if anyone can help me with this. Thanks a lot in advance!

cwyjm
  • 61
  • 7

2 Answers2

1

Try to wait for elements (probably they added by JS code):

from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

reviews = WebDriverWait(driver, 10).until(EC.presence_of_all_elements_located((By.CLASS_NAME, "review-container")))

for review in reviews:
    _rating = review.find_elements_by_class_name('rating-other-user-rating')
    rating = _rating[0].text if _rating else 'N/A' 
    _comment = review.find_elements_by_class_name('content')
    comment = _comment[0].text if _comment else 'N/A' 
    print(rating + ": " + comment)
DonnyFlaw
  • 581
  • 3
  • 9
  • I tried out your code but it seems that it skips the reviews without ratings. May I clarify if the line `review = driver.find_elements_by_class_name("review")` and the `for` loop should still be used? – cwyjm Nov 25 '20 at 10:40
  • @cwyjm does class name for reviews without rating differs from reviews with rating? – DonnyFlaw Nov 25 '20 at 11:23
  • No they don't. They only differ between the existence of rating. – cwyjm Nov 25 '20 at 12:38
  • @cwyjm check updated answer. Is that what you want as output? – DonnyFlaw Nov 25 '20 at 12:54
  • yes that is! But why should I insert the wait until command in the reviews part? I thought I should search if the rating exists in each review and it exists in the page. – cwyjm Nov 25 '20 at 13:00
  • @cwyjm yes. you can replace `'review-container'` with `'rating-other-user-rating'` and wait for list of ratings, but in this case you won't get any `"N/A"` as if rating is not found, review will be skipped. So it depend on exact output you want to get – DonnyFlaw Nov 25 '20 at 13:09
0

To extract the rating of each review (ie. Extract 7 from 7/10 in a review) using Selenium and you have to induce WebDriverWait for visibility_of_all_elements_located() and you can use either of the following Locator Strategies:

  • Using XPATH, span index and text attribute:

    print([my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//div[@class='review']//span[@class='user-rating']//following::span[1]")))])
    
  • Using XPATH, attribute and get_attribute():

    print([my_elem.get_attribute("innerHTML") for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//div[@class='review']/span[@class='user-rating']//span[not(contains(@class,'scale'))]")))])
    
  • Note : You have to add the following imports :

    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support import expected_conditions as EC
    

Outro

Link to useful documentation:

undetected Selenium
  • 183,867
  • 41
  • 278
  • 352