2

I was trying to scrape a website to extract the data from many pages using rvest and purrr. but everytime I run the code the "Error in open.connection(x, "rb") : HTTP error 404." appears.

url <- "http://books.toscrape.com/catalogue/page-%d"

map_df(1:10, function(i){ 
  
  page <- read_html(sprintf(url, i))
   cat(".")
  
  booksdf <- data.frame(safely( title <- html_nodes(page, "h3, #title") %>% html_text(),
                       price <- html_nodes(page, ".price_color") %>% html_text() %>% gsub("£", "", .),
                       rating <- html_nodes(page, ".star-rating") %>% html_attrs() %>% str_remove("star-rating") %>%str_replace_all(c("One" = "1", "Two" = "2", "Three" = "3", "Four" = "4", "Five" = "5")) %>%  as.numeric()
                       )
                      
  )
  
  
} 
)
Error in open.connection(x, "rb") : HTTP error 404.
Dharman
  • 30,962
  • 25
  • 85
  • 135
Teby
  • 23
  • 1
  • 4
  • And how is this an R coding error? – Rui Barradas May 25 '19 at 19:49
  • 2
    A 404 error indicates that the page you requested was not found in the server. I suggest you check to make sure you are building the correct URL. If I go to `http://books.toscrape.com/catalogue/page-2` in my browser I get a 404 error just like R. Did you forget to add the ".html" part to the URL – MrFlick May 25 '19 at 19:57
  • Thak you @MrFlick that worked. but when i run the code there's another error "Error in type(pattern) : argument "pattern" is missing, with no default" I couldn't find any missing argument (each code runs perfect outside map_df). – Teby May 25 '19 at 20:31

1 Answers1

2

We can create the URL's to scrape and then use map_df to bind the dataframes together.

library(tidyverse)
library(rvest)

url <- "http://books.toscrape.com/catalogue/page-"
pages <- paste0(url, 1:10, ".html")

map_df(pages, function(i){ 
     page <- read_html(i)
     data.frame(title = html_nodes(page, "h3, #title") %>% html_text(),
                price = html_nodes(page, ".price_color") %>% html_text() %>% 
                        gsub("£", "", .),
                rating = html_nodes(page, ".star-rating") %>% html_attrs() %>% 
                         str_remove("star-rating") %>%
                         str_replace_all(c("One" = "1", "Two" = "2", 
                         "Three" = "3", "Four" = "4", "Five" = "5")) %>%  
                          as.numeric())
})


#                                            title price rating
#1                               A Light in the ... 51.77      3
#2                               Tipping the Velvet 53.74      1
#3                                       Soumission 50.10      1
#4                                    Sharp Objects 47.82      4
#5                     Sapiens: A Brief History ... 54.23      5
#6                                  The Requiem Red 22.65      1
#7                     The Dirty Little Secrets ... 33.34      4
#8                          The Coming Woman: A ... 17.93      3
#.....
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213