received Error in open.connection(x, "rb") : HTTP error 404. after running a for-loop in r

Question

While trying to scrape information from several links, I got the error: Error in open.connection(x, "rb") : HTTP error 404.

I feel like it has something to do with the first part of my for-loop, so I tried changing numbers from character to numeric, but that did not fix the problem. I also tried advice here, however, it returned more problems.

Think you can spot where I went wrong?

library(rvest)
library(tidyverse)

pageMen = read_html('https://www.bjjcompsystem.com/tournaments/1869/categories')
get_links <- pageMen %>% 
  html_nodes('.categories-grid__category a') %>% 
  html_attr('href') %>%
  paste0('https://www.bjjcompsystem.com', .) 

# extract numerical part of link
numbers = str_sub(get_links, - 7, - 1)  
numbers = as.numeric(numbers)

## create empty vector  ----------------------------
master1.tree = data.frame()

## Create for loop ---------------------------------
for (i in length(numbers)){
  url <- read_html(paste0('https://www.bjjcompsystem.com/tournaments/1869/categories/', i))
  
ageDivision <- url %>% html_nodes('.category-title__age-division') %>% html_text()

gender <- url %>% html_nodes('.category-title__age-division+ .category-title__label') %>% html_text()  

matches = data.frame('division' = ageDivision,'gender' = gender)
master1.tree <- rbind(master1.tree, data.frame(matches))
}

I also ran this, but it did not return the data frame for the scraped data. Instead it printed the results on the screen instead

map_df(get_links, function(i){
  url <- read_html(i)
  
matches <- data.frame(ageDivision <- url %>% 
  html_nodes('.category-title__age-division') %>% html_text(),
gender <- url %>% html_nodes('.category-title__age-division+ .category-title__label') %>% html_text() ) 

master1.tree <- rbind(master1.tree, matches)
})

Try `for (i in numbers)` to loop over your vector `numbers`. — stefan, Sep 11 '22 at 20:44
It did not do anything. It looked like it was running something, but once it stopped nothing happened. There was nothing in my environment either. — bandcar, Sep 11 '22 at 20:49
nvm! I just made a few small adjustments and it worked with your suggestion! — bandcar, Sep 11 '22 at 20:56

score 1 · Accepted Answer · answered Sep 11 '22 at 21:07

Here is an alternative to your code. First, it's not necessary to extract the numbers. You can directly loop over the vector get_links. Second, I use purrr::map_df for the looping part which is a more concise way than using the for loop. To this end I use a custom function to scrape one of your pages. Finally, I use trim=TRUE with html_text to remove the leading and trailing white space:

library(rvest)
library(tidyverse)

pageMen = read_html('https://www.bjjcompsystem.com/tournaments/1869/categories')

get_links <- pageMen %>% 
  html_nodes('.categories-grid__category a') %>% 
  html_attr('href') %>%
  paste0('https://www.bjjcompsystem.com', .)

scrape_page <- function(url) {
  html <- read_html(url)
  data.frame(
    division = html %>% html_nodes('.category-title__age-division') %>% html_text(trim = TRUE),
    gender = html %>% html_nodes('.category-title__age-division+ .category-title__label') %>% html_text(trim = TRUE)
  )
}

master1.tree <- purrr::map_df(get_links[1:5], scrape_page)

master1.tree
#>   division gender
#> 1 Master 1   Male
#> 2 Master 1   Male
#> 3 Master 1   Male
#> 4 Master 1   Male
#> 5 Master 1   Male

This looks great, thank you! I have a follow-up question. Any idea why I need to run these codes (yours and mine) twice before r puts anything in the environment? I'll run the code, it looks like it's running something, and when it stops, the environment is empty. So, I run it a second time and everything is there. — bandcar, Sep 11 '22 at 21:42
Hm. No clue what could be the issue. When I run my or your code it just works fine on my machine. Sometimes simply restarting the R session helps. — stefan, Sep 11 '22 at 21:46

score 0 · Answer 2 · answered Sep 11 '22 at 20:57

library(rvest)
library(tidyverse)

pageMen = read_html('https://www.bjjcompsystem.com/tournaments/1869/categories')

get_links <- pageMen %>% 
  html_nodes('.categories-grid__category a') %>% 
  html_attr('href') %>%
  paste0('https://www.bjjcompsystem.com', .) 

# extract numerical part of link
numbers = str_sub(get_links, - 7, - 1)  
numbers = as.numeric(numbers)

## create empty vector  ----------------------------
master1.tree = data.frame()

## Create for loop ---------------------------------
for (i in numbers){
  url <- read_html(paste0('https://www.bjjcompsystem.com/tournaments/1869/categories/', i))

ageDivision <- url %>% 
html_nodes('.category-title__age-division') %>% 
html_text()

gender <- url %>% 
html_nodes('.category-title__age-division+ .category-title__label') %>% 
html_text()

matches = data.frame('division' = ageDivision,'gender' = gender)
master1.tree <- rbind(master1.tree, matches)
}

received Error in open.connection(x, "rb") : HTTP error 404. after running a for-loop in r

2 Answers2