how do i avoid error in open.connection(x, "rb") : HTTP error 404 when webscraping with rvest

Question

Here's the context of the problem I'm facing:

I have 202 URLs stored in a vector and I'm trying to scrape information from them using a for loop.

The URLs are basically every product that shows up within this website: https://lista.mercadolivre.com.br/_CustId_38356530

I obtained them using this code:

get_products <- function(n_page) {
  cat("Scraping index", n_page, "\n")
  page <- str_c(
    "https://lista.mercadolivre.com.br/_Desde_",
    n_page,
    "_CustId_38356530_NoIndex_True"
  ) %>%
    read_html()
  
  tibble(url = page %>% 
      html_elements('a.ui-search-link') %>% 
      html_attr('href') %>% 
      str_subset('tracking_id') %>% 
      unique()
  )}

products_url <- map_dfr(seq(1, 49 * 4, by = 48), get_products)

Problem is: I keep getting an error:

error in open.connection(x, "rb") : HTTP error 404

I have read a few articles and Q&A sessions discussing this problem, but I can't seem to find a solution that works for my case.

For example, someone suggests that the error happens when the page doesn't exist:

rvest Error in open.connection(x, "rb") : HTTP error 404

However that's not the case - when I go the URLs that faced this problem, they work just fine.

Plus, if that were the case, when I ran the code again, I should get the error for the same values inside the vector. But they seem to be happening randomly.

For example:

The first time I ran the code, I got the error on vector[6].
The second time I ran the same snippet, scraping vector [6] worked just fine.

It was also suggested that I should use try () or tryCatch() to avoid the error from stopping the for loop.

And for that purpose, try() worked.

However it would be preferable if I could avoid getting the error - because if I don't, I'll have to run the same snippet of code a few times in order to scrape every value I need.

Can anyone help me, please?

Why is it happening and what can I do to prevent it?

Here's the code I'm running, if it helps:

for (i in 1:length(standard_ad)) { 
  try(
  collectedtitles <- collect(standard_ad[i],'.ui-pdp-title'))
  assign('standard_titles', append(standard_titles, collectedtitles))
}

'Collect' being a function I created:

collect <- function(webpage,section) {
  page <- read_html(webpage)
  value <- html_node(page, section)
  value <- html_text(value)
}

To better assist you, it would be helpful if you could provide the specific website that you are attempting to scrape. This would allow us to test the code and troubleshoot any issues more effectively — Chamkrai, Dec 31 '22 at 17:33
The web may be throttling the requests. Try using `Sys.sleep(2)` to add a bit of delay between requests, this may help. — Dave2e, Dec 31 '22 at 17:34
@Tom i'm trying to scrape the products within this website 'https://lista.mercadolivre.com.br/_CustId_38356530'! it's in portuguese, but I hope that won't be an issue — Marina Bonatti, Dec 31 '22 at 17:57
When I run `rvest::read_html('https://lista.mercadolivre.com.br/_CustId_38356530')` an html document is produced, so I'm unable to reproduce a 404 error at that webpage. Voting to close as 'not reproducible'. — IRTFM, Dec 31 '22 at 18:22
I faced this problem. As @Dave2e suggested, `Sys.sleep(x)` solved it in all cases. Try 2 or even more, if you still get inconsistent results. — PavoDive, Dec 31 '22 at 18:32
@IRTFM i meant the product pages within 'https://lista.mercadolivre.com.br/_CustId_38356530'. a total of 222 URLs as of today — Marina Bonatti, Dec 31 '22 at 18:56

Chamkrai · Accepted Answer · 2023-01-02T14:56:43.170

From the link you provided, I scrape the five available pages for that query without getting any error. Would you care to better explain how you got your error?

get_products <- function(page) {
  cat("Scraping index", page, "\n")
  page <- str_c(
    "https://lista.mercadolivre.com.br/",
    "_Desde_",
    page,
    "_CustId_38356530_NoIndex_True"
  ) %>%
    read_html()
  
  tibble(
    title = page %>%
      html_elements(".shops__item-title") %>%
      html_text2() %>%
      str_squish(),
    price = page %>%
      html_elements(".ui-search-layout__item") %>%
      html_element(".price-tag-text-sr-only") %>%
      html_text2() %>%
      str_replace_all(" reais con ", ".") %>%
      str_remove_all(" centavos| reais") %>%
      as.numeric(), 
    product_link = page %>% 
      html_elements(".ui-search-result__content.ui-search-link") %>% 
      html_attr("href")
  )}

df <- map_dfr(seq(1, 49 * 4, by = 48), get_products)

Scraping the amount sold from individual product sites with polite package. Polite was designed to be scraping friendly towards sites, therefore it will be slower than rvest but more reliable in certain scenarios. I have scraped 20 pages successfully without any issues. Run the previous code and then this one:

library(polite) 

sold_amount <- function(product_link) {
  cat("Scraping", product_link, "\n")
  product_link %>% 
    bow(force = TRUE) %>% 
    scrape() %>%  
    html_element(".ui-pdp-subtitle") %>%  
    html_text2() %>%  
    str_remove_all("[^0-9]") %>% 
    as.numeric()
}

df <- df %>%  
  mutate(sold = map_dbl(product_link, sold_amount))

# A tibble: 20 × 4
   title                                                price product_link  sold
   <chr>                                                <dbl> <chr>        <dbl>
 1 Fone Superlux Hd661 Para Retorno Baterista Teclado …  433. https://pro…    NA
 2 Kit Caixas Donner Saga 12 Ativa 250w + Passiva 130w… 2290  https://pro…     8
 3 Fone Para Gamer Jogar Ps4 Pc Xbox One P2 Celular He…  292  https://pro…    NA
 4 Pandeiro Contemporânea 10 Polegadas Leve Light Cour…  233. https://pro…   131
 5 Caixa De Som Ll Audio Up ! 8 Com Bluetooth Fm Usb E…  658  https://pro…     4
 6 Violão De Nylon Náilon Giannini N-14 Natural Série …  464  https://pro…     2
 7 Amplificador De Som Receiver Sa20 100w Usb Card Blu…  948  https://pro…     1
 8 Amplificador Randall Big Dog 15w Guitarra Mostruári…  434  https://pro…    NA
 9 Caixa De Som Leacs 10'' Fit 160 Passiva Retorno Mon…  890  https://pro…    NA
10 Caixa De Som Amplificada Ll Lx40 Microfone Guitarra…  444  https://pro…    33
11 Violão De Nylon Náilon Giannini N-14 Natural Série …  440  https://pro…     2
12 Microfone Sem Fio Jwl Headset Duplo Uhf U-585 Hh + …  700. https://pro…    42
13 Direct Box Single-channel Bypass Waldman - Passivo …  159. https://pro…     1
14 Caixa Acústica Donner Saga 12 Duas Vias Passiva 130…  840. https://pro…    NA
15 Mesa De Som Mixer Nca Nanomix Ll Na402r 4 Canais Bi…  359  https://pro…    14
16 Guitarra elétrica Tagima TW Series TW-61 de choupo … 1529  https://www…   154
17 5 Microfone Superlux De Mão Dinâmico C1 + Cachimbo … 1598  https://pro…     3
18 Estante Máquina Ferragem Suporte Chimbal Turbo Powe…  378. https://pro…     5
19 Microfone Skypix M58 Para Igreja, Banda, Eventos...…   98  https://pro…    NA
20 Teclado Yamaha Psr F52 Com Fonte Bivolt - F 52        977  https://www…   666

wow, @Tom you're approach is way faster and more effective! thank you so much! the problem is, there's information that is only visible if you actually go to the product webpage. For example here: 'https://produto.mercadolivre.com.br/MLB-1633642316-pandeiro-contempornea-34cl-12-polegadas-linha-leve-_JM#position=46&search_layout=grid&type=item&tracking_id=85f531f7-97ae-4bd0-9c9d-ce3f348473a9', I needed to scrape that '6 vendidos' piece of information in the .ui-pdp-subtitle. So my approach was to go to each product page — Marina Bonatti, Dec 31 '22 at 18:54
@MarinaBonatti I believe you are using the wrong tag, hence the error. Try with `ui-pdp-header__title-container`. But my code grabs the same titles, no? — Chamkrai, Dec 31 '22 at 19:04
your code works perfectly for the titles and prices, @Tom! however you use the search pages, right? in this case, i wanted to scrape a piece of information that can only be seen in the product page. it's the quantity of sold items and the item condition. in the url i sent you before, it would be the 'Novo | 6 vendidos'. for those cases i still get the 404 error, cause inevitably i have to read_html each product page, and i'm assuming the web might be throttling the requests as @Dave2e said :( sys.Sleep seems to be helping, but still hasn't fixed it, even using 50 seconds as a parameter — Marina Bonatti, Jan 02 '23 at 14:20
@MarinaBonatti Check out my edit. This should solve the problem. Products with none sold with be listed as NA. — Chamkrai, Jan 02 '23 at 14:57
wow, @Tom! i made a few changes on the code so i could get exactly the results i wanted, but that polite package is a life safer! i managed to scrape 402 pages successfully (no issues whatsoever). that alternative is not only more effective but it also ended up being 'faster' than rvest (considering the sys.Sleep I had to use to try to prevent the error from happening)! you're incredible, thank you so much for your help and patience! — Marina Bonatti, Jan 03 '23 at 00:25

how do i avoid error in open.connection(x, "rb") : HTTP error 404 when webscraping with rvest

1 Answers1