0

I am webscraping this website, I am having troubles when I try to rbind all the columns in the datasets. it gives me error. it says because different number of rows between columns in the dataset, for example 25 elements in price and 24 in description.

{if(length(.) == 0) NA else .}

I tried to put the piece of code above to put NAs when the webscrape program doesn't find values but it looks it doesn't work, I leave the full code below.

urls <- sprintf("https://www.immobiliare.it/vendita-case/milano/?pag=%d", 1:7)

case <- data.frame() 

for (i in urls){
  page <- read_html(i)
  
  Scrape <- page  %>% html_nodes(xpath= "//ul[@class='nd-list in-realEstateResults']") %>% 
    purrr::map_df(~list(description= html_nodes(.x, xpath= "//a[@class='in-card__title']") %>% html_text() %>% length() %>% {if(length(.) == 0) NA else .}, #Returns NA for missing data
                        
                        price= html_nodes(.x, xpath= "//li[@class='nd-list__item in-feat__item in-feat__item--main in-realEstateListCard__features--main']") %>% html_text(trim = TRUE) %>% {if(length(.) == 0) NA else .},
                        
                        rooms = html_nodes(.x,xpath= "//li[@aria-label='locali']") %>% html_text(trim = TRUE) %>% {if(length(.) == 0) NA else .},
                        
                        area= html_nodes(.x,xpath= "//li[@aria-label='superficie']") %>% html_text(trim = TRUE) %>% {if(length(.) == 0) NA else .}))
  
  
  temp <- data.frame(Scrape)
  case <- rbind(temp, case)
  
  print(paste("Page:",i))
}

any suggestions? let me know if you have any questions

Gio 255
  • 1
  • 1
  • 1
    Find out for which iteration of the loop your code is failing. Look at the values of `description`, `price`, `room` and `area` for that iteration. (You'll need to take the calls to `html_nodes` out of the call to `map_df` and assign them to temprary variables. If you can't figure out what's going on, post the relevant data using `dput()` to your answer. It may well be that you have made an invalid assumption about the structure of the website. – Limey Jul 04 '22 at 15:56
  • thank you for the replay but when i webscrape every single element alone the webscrape works but sometimes the scrape doen't recogize elements in the page, i don't know why. alone price is a value of 270 rows and description is 273. i tried to webscrape them alone but it just doesn't make sense bind them after because the rows order is messed up – Gio 255 Jul 04 '22 at 16:37
  • See this question https://stackoverflow.com/questions/56673908/how-do-you-scrape-items-together-so-you-dont-lose-the-index/56675147#56675147 – Dave2e Jul 04 '22 at 16:47
  • @Dave2e it doesn't help me sorry – Gio 255 Jul 04 '22 at 17:14
  • That is precisely my point. You need to figure out where the mismatch betwen price and the other item(s) occurs. (Perhaps there are some properties that are "price on application"? How are they displayed? ) Then you neeed to figure out how to fix it. Web scraping will always be a fragile business because it relies totally on something outside your control: the structure of a remote web page. Since you are reluctant to provide the information we _need_ to help you, I am voting to close for lack of debugging detail. – Limey Jul 04 '22 at 17:49
  • i ask how the code can put na when they don't recognize the xpath i gave to the code, don't close the feed – Gio 255 Jul 04 '22 at 19:48

1 Answers1

0

If you review the question which I reference in my comment. The problem you are facing, is not every listing node has in all of the information you are looking for thus generating the non equal length errors.

The best way to handle these situations is to find all parent nodes in a list/vector and then extract the desired information from each parent using the html_node (without the "s") function. html_node() will always return 1 result for every node, even if it is NA. html_nodes() will return nothing.
See comments for more information.

library(rvest)
library(dplyr)

df_apartments <- list()
for (i in 1:7) { 
   #read page
   page <- read_html(paste0("https://www.immobiliare.it/vendita-case/milano/?pag=", i))
   
   #read the parent nodes
   apartments <- page  %>% html_nodes(xpath= "//div[@class='nd-mediaObject__content in-card__content in-realEstateListCard__content']")
   
# parse information from each of the parent nodes
  price <- apartments %>% html_node(xpath= ".//li[@class='nd-list__item in-feat__item in-feat__item--main in-realEstateListCard__features--main']") %>% html_text(trim = TRUE)
  rooms <- apartments %>% html_node(xpath= ".//li[@aria-label='locali']") %>% html_text(trim = TRUE)
  area <- apartments %>% html_nodes(xpath= ".//li[@aria-label='superficie']") %>% html_text(trim = TRUE)
  description <-  apartments %>% html_node( xpath= ".//a[@class='in-card__title']") %>% html_text()
      
# put the data together into a data frame add to list                  
   df_apartments[[i]] <- data.frame(price, rooms, area, description)
}
#combine all data frames into 1
answer <- bind_rows(df_apartments)


df_apartments
         price rooms  area                                                            description
1    € 900.000     3 125m²                     Trilocale via Orti 2, Quadronno - Crocetta, Milano
2    € 275.000     2  50m²                Bilocale via Gian Francesco Pizzi 34, Ripamonti, Milano
3    € 275.000     2  55m²                 Bilocale viale dei Mille 14, Plebisciti - Susa, Milano
4    € 799.000     4 135m²                 Quadrilocale via Beato Angelico 3, Città Studi, Milano
5    € 210.000     2  65m²                    Bilocale piazza Monte Falterona 5, San Siro, Milano
6    € 240.000     2  50m²                Bilocale via dell'Assunta 5, Vigentino - Fatima, Milano
7    € 395.000     3  90m² Trilocale via Cuore Immacolato di Maria 12, Vigentino - Fatima, Milano

Update
Because we are using the xpath option, we need to add a "." prior to the "//" to tell the xpath parser to start at the current node and note the first node.

Dave2e
  • 22,192
  • 18
  • 42
  • 50
  • man. thank you for the answer, i will try again but you gave me a solution i have already tried before writing here. maybe this time is going to be the one, but usually that's give the first node and the other one repeated 25 times – Gio 255 Jul 05 '22 at 12:04
  • thank you so much, for one page it works flowless but when i put it in a loop it show me only the first page of results, even if i use rbind – Gio 255 Jul 05 '22 at 23:20
  • `urls <- sprintf("https://www.immobiliare.it/vendita-case/milano/?pag=%d", 2:7) for (i in urls) { page <- read_html(i) ` and then I put the code in it in the for loop – Gio 255 Jul 06 '22 at 12:34
  • I tried and it works better but at page 25 it gives me the same error that different columns rbind can't proceed – Gio 255 Jul 07 '22 at 10:16
  • I don’t know, maybe it is something with the last page. Maybe for a new question concerning that page and in the meantime you can work with 24 pages of data and – Dave2e Jul 07 '22 at 10:58