Here's the context of the problem I'm facing:
I have 202 URLs stored in a vector and I'm trying to scrape information from them using a for loop.
The URLs are basically every product that shows up within this website: https://lista.mercadolivre.com.br/_CustId_38356530
I obtained them using this code:
get_products <- function(n_page) {
cat("Scraping index", n_page, "\n")
page <- str_c(
"https://lista.mercadolivre.com.br/_Desde_",
n_page,
"_CustId_38356530_NoIndex_True"
) %>%
read_html()
tibble(url = page %>%
html_elements('a.ui-search-link') %>%
html_attr('href') %>%
str_subset('tracking_id') %>%
unique()
)}
products_url <- map_dfr(seq(1, 49 * 4, by = 48), get_products)
Problem is: I keep getting an error:
error in open.connection(x, "rb") : HTTP error 404
I have read a few articles and Q&A sessions discussing this problem, but I can't seem to find a solution that works for my case.
For example, someone suggests that the error happens when the page doesn't exist:
rvest Error in open.connection(x, "rb") : HTTP error 404
However that's not the case - when I go the URLs that faced this problem, they work just fine.
Plus, if that were the case, when I ran the code again, I should get the error for the same values inside the vector. But they seem to be happening randomly.
For example:
The first time I ran the code, I got the error on vector[6].
The second time I ran the same snippet, scraping vector [6] worked just fine.
It was also suggested that I should use try () or tryCatch() to avoid the error from stopping the for loop.
And for that purpose, try() worked.
However it would be preferable if I could avoid getting the error - because if I don't, I'll have to run the same snippet of code a few times in order to scrape every value I need.
Can anyone help me, please?
Why is it happening and what can I do to prevent it?
Here's the code I'm running, if it helps:
for (i in 1:length(standard_ad)) {
try(
collectedtitles <- collect(standard_ad[i],'.ui-pdp-title'))
assign('standard_titles', append(standard_titles, collectedtitles))
}
'Collect' being a function I created:
collect <- function(webpage,section) {
page <- read_html(webpage)
value <- html_node(page, section)
value <- html_text(value)
}