0

I am currently doing a Google Play Store scraper that scrapes reviews from a particular app and writes the reviews to a file. To do that, I used Python Selenium to search for all the reviews here: https://play.google.com/store/apps/details?id=com.grabtaxi.passenger&showAllReviews=true, and thereafter extracted out all the reviews.

All the reviews have been identified to be within a specific class zc7KVe, and therefore the XPath I have identified for obtaining all the reviews is: //div[@class='zc7KVe'].

This is the line of code in Python used to find such elements using the above XPath, which was done inside a while loop: reviews = driver.find_elements(By.XPATH, '//div[@class=\'zc7KVe\']'

The problem is that when I keep scrolling down the page further, the length of the reviews variable gets larger and larger. This is because the above XPath searches for all elements that satisfy the condition. This causes the time taken for the scraping operation to exponentially increase (e.g. after scrolling down the page 80 times, it took over 20 minutes to scrape 240 new sets of reviews as compared to 30 seconds when I first started).

To make it faster, I am exploring including position() inside my XPath so that I do not need to extract out all the elements that satisfy the condition. I have studied this and tried to test the XPath in Chrome DevTools like //div[contains(@class,'zc7KVe') and (position() >= 100) and not (position > 200)] but to no avail.

Is there an XPath that can satisfy searching by the specific class and also the range?

ADD

When inspecting in DevTools, the structure of the HTML would look like this:

<div jscontroller="..." jsmodel="..." jsdata="..." ...>
    <div class="zc7KVe">
        <!-- One review -->
<div jscontroller="..." jsmodel="..." jsdata="..." ...>
    <div class="zc7KVe">
        <!-- One review -->
<!-- and so on -->
Teo Wei Shen
  • 23
  • 1
  • 5

1 Answers1

3

There are multiple different ways to improve on performance here:

  • first scroll up until you got all the reviews (or a certain number) and only then extract them
  • let HTML parsers do the HTML parsing which would let you cut down on the number of JSON over HTTP selenium commands and other overhead of finding elements via selenium webdriver. You can get the inner/outer HTML of the review section and parse it with, for example, BeautifulSoup. Something along these lines:

    In [8]: reviews = driver.find_element_by_xpath("//h3[. = 'User reviews']/following-sibling::div[1]")
    
    In [9]: soup = BeautifulSoup(reviews.get_attribute("outerHTML"), "lxml")
    
    In [10]: for review in soup.div.find_all("div", jscontroller=True, recursive=False):
                 author = review.find(class_="X43Kjb").get_text()
                 print(author)   
    Angie Lin
    Danai Sae-Han
    Siwon's Armpit Hair
    Vishal Mehta
    Ann Leong
    V. HD
    Mark Stephen Masilungan 
    ...
    Putra Pandu Adikara
    kei tho
    Phụng Nguyễn
    
  • remember the last element you've got a review from and use following-sibling axis to extract following siblings after this element
  • you may also look into Google Play API and official or unofficial clients (like this one) which may help you to look at the problem from a different angle
  • and, if you are still up for an XPath approach and use position() to filter things out by a "range", you could just operate in a scope of the container holding up the reviews:

    //div[@jsmodel = 'y8Aajc' and position() >= 10 and position() <= 20]
    
alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195