0

I am trying to scrape a database containing information about previously sold houses in an area of Denmark. I want to retrieve information from not only page 1, but also 2, 3, 4 etc.

I am new to R but from an tutorial i ended up with this.

library(purrr)
library(rvest)

urlbase <- "https://www.boliga.dk/solgt/alle_boliger-4000ipostnr=4000&so=1&p=%d"
map_df(1:5,function(i){
    cat(".")
    page <- read_html(sprintf(urlbase,i))

    data.frame(Address = html_text(html_nodes(page,".d-md-table-cell a")))
               Price = html_text(html_nodes(page,".text-md-left+ .d-md-table-cell .text-right"))
               Rooms = html_text(html_nodes(page,".d-md-table-cell:nth-child(5) .paddingR"))
               m2 = html_text(html_nodes(page,".qtipped+ .d-md-table-cell .paddingR"))
               stringsAsFactors = FALSE

}) -> BOLIGA.ROSKILDE

View(BOLIGA.ROSKILDE)

Which gives me the message:

Error in bind_rows_(x, .id) : Argument 1 must have names

Any help would be welcome

kath
  • 7,624
  • 17
  • 32
Thomas
  • 17
  • 6
  • To me the `https://www.boliga.dk/solgt/alle_boliger-4000ipostnr=4000&so=1&p=%d` is not working, giving `Bad Request - Invalid URL`. – s__ Aug 16 '18 at 13:29
  • https://www.boliga.dk/solgt/alle_boliger-4000ipostnr=4000&so=1&p=1 /// ups sorry.. %d should be 1 for page 1, to for page 2 etc. – Thomas Aug 16 '18 at 14:10
  • Have you had success with similar code but for only one page? When I have problems with setups like this, I start with running code for one iteration of the `map`, such as scraping a single page. Then I try `map` instead of `map_dfr`, as it's less strict about the structure. An error in `bind_rows` suggests to me that the problem is in binding all the `map` outputs into one data frame – camille Aug 16 '18 at 14:16
  • You also are missing commas in between the code to make columns in your data frame, and then have `stringsAsFactors = FALSE` hanging on its own, so you're just returning `FALSE`, not a data frame – camille Aug 16 '18 at 14:22

1 Answers1

2

Try this one:

library(rvest)
library(tidyverse)
url="https://www.boliga.dk/solgt/alle_boliger-4000ipostnr=4000?ipostnr=4000ipostnr&so=1&p=1"

# find number of pages in table

   pgs<- ceiling(read_html(url)%>%
                html_nodes(".d-print-none")%>%
                html_nodes("b")%>%
                html_text()%>%
                gsub("[^\\d]+", "", ., perl=TRUE)%>%
                as.numeric()
              /40)

#scrap our table 

scrap=function(pg){
  url=paste0("https://www.boliga.dk/solgt/alle_boliger-4000ipostnr=4000?ipostnr=4000ipostnr&so=1&p=",pg)
  return( read_html(url)%>%
  html_node(".searchResultTable")%>%
  html_table()%>%
  .[,c(1,2,5,4)]%>%
    magrittr::set_colnames(c("Address","Price","Rooms","m2"))%>%
    mutate(m2=as.numeric(m2))
  )
}

#purrr for each page

df=seq(1,pgs)%>%
  map_df(.,scrap)
jyjek
  • 2,627
  • 11
  • 23