I am currently doing a Google Play Store scraper that scrapes reviews from a particular app and writes the reviews to a file. To do that, I used Python Selenium to search for all the reviews here: https://play.google.com/store/apps/details?id=com.grabtaxi.passenger&showAllReviews=true, and thereafter extracted out all the reviews.
All the reviews have been identified to be within a specific class zc7KVe
, and therefore the XPath I have identified for obtaining all the reviews is:
//div[@class='zc7KVe']
.
This is the line of code in Python used to find such elements using the above XPath, which was done inside a while loop:
reviews = driver.find_elements(By.XPATH, '//div[@class=\'zc7KVe\']'
The problem is that when I keep scrolling down the page further, the length of the reviews
variable gets larger and larger. This is because the above XPath searches for all elements that satisfy the condition. This causes the time taken for the scraping operation to exponentially increase (e.g. after scrolling down the page 80 times, it took over 20 minutes to scrape 240 new sets of reviews as compared to 30 seconds when I first started).
To make it faster, I am exploring including position()
inside my XPath so that I do not need to extract out all the elements that satisfy the condition. I have studied this and tried to test the XPath in Chrome DevTools like //div[contains(@class,'zc7KVe') and (position() >= 100) and not (position > 200)]
but to no avail.
Is there an XPath that can satisfy searching by the specific class and also the range?
ADD
When inspecting in DevTools, the structure of the HTML would look like this:
<div jscontroller="..." jsmodel="..." jsdata="..." ...>
<div class="zc7KVe">
<!-- One review -->
<div jscontroller="..." jsmodel="..." jsdata="..." ...>
<div class="zc7KVe">
<!-- One review -->
<!-- and so on -->