0

I have the following code to try and scrape a website:

url = "https://www.fotocasa.es/es/alquiler/casas/madrid-capital/todas-las-zonas/l"

x <- GET(url)

x %>% 
  read_html() %>% 
  html_nodes(xpath='//*[@id="App"]/div[2]/div[1]/main/div/div[4]') 

What I would like to do is collect the page numbers at the bottom of the page, previously the following worked html_nodes(".sui-PaginationBasic-item a") however, now it does not. I have tried to put the xpath using inspect element.

The output would be something like:

c(1, 2, 3, 4, 5, ..., 101)

Depending on how many pages are on tis given page.

user113156
  • 6,761
  • 5
  • 35
  • 81
  • 1
    Hovering over the pagination buttons reveals a consistent URL pattern www.fotocasa.es/es/alquiler/viviendas/madrid-capital/todas-las-zonas/l/`n`, where `n` is the page from 1 to currently 200. Harvest with your method of choice. Mind that the server might just block you from firing automated requests. –  Apr 08 '22 at 20:35
  • Thanks! I just need to determine what the max `n` is - For each region the max `n` will be different. – user113156 Apr 08 '22 at 20:47
  • Are you getting a captcha response? – QHarr Apr 08 '22 at 21:07
  • No, I can access the website from the browser without problem. – user113156 Apr 08 '22 at 21:08
  • Unfortunately I know too little about the details but from this post it seems you might have to take the extra step to simulate visiting with a webbrowser using {RSelenium}: https://stackoverflow.com/a/51541800/18309711 –  Apr 08 '22 at 21:32
  • 1
    Just keep incrementing the page until you generate an error. – Dave2e Apr 09 '22 at 00:55

1 Answers1

1

Using RSelenium we can get the page numbers from pagination by,

    library(stringr)
    library(RSelenium)
    library(dplyr)

#launching browser 
driver = rsDriver(browser = c("firefox"))
remDr <- driver[["client"]]
url = "https://www.fotocasa.es/es/alquiler/casas/madrid-capital/todas-las-zonas/l"
remDr$navigate(url)

#accept cookie
    remDr$findElement(using = "xpath",'/html/body/div[1]/div[4]/div/div/div/footer/div/button[2]')$clickElement()
    
#scroll to the end of page
    webElem <- remDr$findElement("css", "html")
    webElem$sendKeysToElement(list(key="end"))
    
#use the up_arrow to get pagination into view
    webElem$sendKeysToElement(list(key = "up_arrow"))
#get page url from pagination  
    link = remDr$getPageSource()[[1]] %>% read_html() %>% html_nodes('.re-Pagination') %>%  html_nodes('a') %>% 
      html_attr('href')
#extract only page numbers from urls
    str_extract(link, "[[:digit:]]+")
    [1] NA    "2"   "3"   "4"   "200" "2"  
Nad Pat
  • 3,129
  • 3
  • 10
  • 20