3

I am using Selenium within R.

I have the following script which searches Google Maps for all pizza restaurants around a given geographical coordinate - and then keeps scrolling until all restaurants are loaded.

First, I navigate to the starting page:

library(RSelenium)
library(wdman)
library(netstat)

selenium()
seleium_object <- selenium(retcommand = T, check = F)

remote_driver <- rsDriver(browser = "chrome", chromever = "114.0.5735.90", verbose = F, port = free_port())

remDr<- remote_driver$client

lat <- 40.7484
lon <- -73.9857

# Create the URL using the paste function
URL <- paste0("https://www.google.com/maps/search/pizza/@", lat, ",", lon, ",17z/data=!3m1!4b1!4m6!2m5!3m4!2s", lat, ",", lon, "!4m2!1d", lon, "!2d", lat, "?entry=ttu")

# Navigate to the URL
remDr$navigate(URL)

Then, I use the following code to keep scrolling until all entries have been loaded:

# Waits 10 seconds for the elements to load before scrolling
elements <- remDr$findElements(using = "css selector", "div.qjESne")

while (TRUE) {
    new_elements <- remDr$findElements(using = "css selector", "div.qjESne")

    # Pick the last element in the list - this is the one we want to scroll to
    last_element <- elements[[length(elements)]]
    # Scroll to the last element
    remDr$executeScript("arguments[0].scrollIntoView(true);", list(last_element))
    Sys.sleep(10)

    # Update the elements list
    elements <- new_elements

    # Check if there are any new elements loaded - the "You've reached the end of the list." message
    if (length(remDr$findElements(using = "css selector", "span.HlvSq")) > 0) {
        print("No more elements")
        break
    }
}

Finally, I use this code to extract the names and addresses of all restaurants:

titles <- c()
addresses <- c()

# Check if there are any new elements loaded - the "You've reached the end of the list." message
if (length(remDr$findElements(using = "css selector", "span.HlvSq")) > 0) {
    # now we can parse the data since all the elements loaded
    for (data in remDr$findElements(using = "css selector", "div.lI9IFe")) {
        title <- data$findElement(using = "css selector", "div.qBF1Pd.fontHeadlineSmall")$getElementText()[[1]]
        restaurant <- data$findElement(using = "css selector", ".W4Efsd > span:nth-of-type(2)")$getElementText()[[1]]

        titles <- c(titles, title)
        addresses <- c(addresses, restaurant)
    }

    # This converts the list of titles and addresses into a dataframe
    df <- data.frame(title = titles, address = addresses)
    print(df)
    break
}

Instead of using Sys.sleep() in R, I am trying to change my code such that only scrolls (i.e., delays the action) once the previous action has been completed. I am noticing that my existing code often freezes half way through and I suspect that this is because I am trying to load a new page when the existing page is not fully loaded. I think it might be better to somehow delay the action and wait for the page to be fully loaded prior to proceeding.

How might I be able to delay my script and force it to wait for the existing page to load before loading a new page? (e.g., R - Waiting for page to load in RSelenium with PhantomJS)

Note: I am also open to a Python solution.

References:

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
stats_noob
  • 5,401
  • 4
  • 27
  • 83
  • Can you not use the wait function from selenium? – Hermann12 Aug 13 '23 at 06:42
  • @Hermann12: thank you for your reply! Do you think you can please show me how to use this function here? – stats_noob Aug 13 '23 at 06:57
  • You can still find a lot of examples [here](https://stackoverflow.com/questions/5868439/wait-for-page-load-in-selenium) – Hermann12 Aug 13 '23 at 08:49
  • Thanks! I saw this link before - i am trying to learn: how to modify these examples for the R programming language – stats_noob Aug 13 '23 at 12:53
  • Explicit waits seem like it will help - I don't know about R, but there are examples for Python: [explicit-waits](https://www.selenium.dev/documentation/webdriver/waits/#explicit-waits) – user7434398 Aug 15 '23 at 12:28
  • you can use the waitUntil function to wait for a specific condition to be met before proceeding. – Rai Hassan Aug 15 '23 at 18:54
  • 1
    The bounty attracted at least one [ChatGPT](https://meta.stackoverflow.com/questions/421831/temporary-policy-chatgpt-is-banned) plagiariser. – Peter Mortensen Aug 18 '23 at 21:23
  • You seem to be scraping data from google. Did you try looking at Google places API instead? – Salman A Sep 01 '23 at 07:48

1 Answers1

2
library(RSelenium)
library(wdman)

# Initialize Selenium driver
driver <- rsDriver(browser = "chrome", verbose = FALSE, port = free_port())
remDr <- driver$client

lat <- 40.7484
lon <- -73.9857

# Create the URL
URL <- paste0("https://www.google.com/maps/search/pizza/@", lat, ",", lon, ",17z/data=!3m1!4b1!4m6!2m5!3m4!2s", lat, ",", lon, "!4m2!1d", lon, "!2d", lat, "?entry=ttu")

# Navigate to the URL
remDr$navigate(URL)

# Wait until the page is fully loaded
remDr$wait(timeout = 10, condition = function(d) {
d$executeScript("return document.readyState === 'complete';")
})

# Your scrolling and data extraction code here

# Close the driver
remDr$close()

The remDr$wait function waits until the document.readyState becomes 'complete', indicating that the page has finished loading. Once the condition is met, you can proceed with your scrolling and data extraction code.

Using remDr$wait with the condition to wait for the page to be fully loaded is a more reliable approach than using Sys.sleep because it ensures that your script waits until the page is actually ready for interaction.

Rai Hassan
  • 599
  • 1
  • 12
  • @Raj Hassan: thank you for your answer! Can you please explain what is the 10 doing in your code? – stats_noob Aug 15 '23 at 22:41
  • Thnks for noticing , I missed the timeout attribute here. The remDr$wait function takes two main arguments: the timeout duration and a condition function. The timeout duration specifies how many seconds RSelenium will wait for the condition to be met before timing out and continuing the script. – Rai Hassan Aug 16 '23 at 09:50
  • @Raj Hassan: thank you for your reply! I am still trying to understand this: if you wait 10 seconds for a timeout but the page loads in 4 seconds - this means you saved 6 seconds. But if the page takes more than 10 seconds to load, you skip to the next action... correct? – stats_noob Aug 16 '23 at 13:15
  • Let me explain for you. If the condition is met within the timeout duration (e.g., the page loads in 4 seconds but you set a timeout of 10 seconds): RSelenium will not wait for the full 10 seconds; it will proceed as soon as the condition is met. If the condition is not met within the timeout duration (e.g., the page takes longer than 10 seconds to load): RSelenium will wait for the full timeout duration of 10 seconds. After the timeout, if the condition is still not met, RSelenium will raise an error or proceed with the next action. – Rai Hassan Aug 16 '23 at 13:18
  • @ Raj Hassan: thank you so much for your explanation! Do you think if you have time, can you please take my full code and show how you can insert your logic into my full code? Thank you so much! – stats_noob Aug 17 '23 at 01:53