0

i am facing trouble need help.

i have list of links (about 9000 links) which i am running in loop and doing some process on each one

links look like this :-

link1 link2 link3 link4 ..... link9000

but i am facing trouble as sometimes link 2nd gets failed (timeout) and sometime link2nd works and 400 or any random link fails as timeout . is there any way i can try failed link again n again ? i have added :-

status_c <- httr::GET(Links, config = httr::config(connecttimeout = 150)) but still i get timeout . please help me! or any suggestion regarding it? final_links_bind = have all list of links some sample links:-

https://vdp.cuzk.cz/vdp/ruian/stavebniobjekty/2146711
https://vdp.cuzk.cz/vdp/ruian/stavebniobjekty/2146703
https://vdp.cuzk.cz/vdp/ruian/stavebniobjekty/2146789

  for(i in 1:nrow(final_links_bind)) {
Links <- final_links_bind[i,]
BP_ID <- final_bp_bind[i,]
#print(Links)
status_c <- GET(Links,timeout(120))
status <- status_code(status_c)
if(status == "200"){
  url_parse<- read_html(Links)
  col_name<- url_parse %>%
    html_nodes("tr") %>%
    html_text()
  col_name <- stringr::str_remove_all(col_name, "\\\t|\\\n|\\\r")
  pattern_col_no <- grep("využití", col_name)
  col_name <- as.data.frame(col_name)
  method_selected <- col_name[pattern_col_no,]
  WRITE_CSV_DATA <- rbind(WRITE_CSV_DATA, data.frame(BP_ID = c(BP_ID), method_selected = c(method_selected), Links = c(Links)))
  #METHOD_OF_USE <- rbind(method_selected,METHOD_OF_USE)
  print(WRITE_CSV_DATA)
  
}else{
  print("LINK NOT WORKING")
  no_Links <- sorted_link[i,]
  not_working_link <- rbind(not_working_link,no_Links)
}

}

oguz ismail
  • 1
  • 16
  • 47
  • 69
walle_eva
  • 81
  • 6
  • Can you provide some of the links and more of your code so we can test? – Chamkrai Jan 19 '23 at 15:24
  • edited ! with code – walle_eva Jan 19 '23 at 15:31
  • what kind of info would you like to extract from the sites? – Chamkrai Jan 19 '23 at 15:40
  • 1
    This [question](https://stackoverflow.com/questions/52371296/create-function-to-avoid-url-error-in-r-for-loop) is relevant. Also, `rbind`ing inside your loop is inefficient. better, write a function that processes a single element of `final_links_bind`, call it using `lapply` and *then* bind the results together... – Limey Jan 19 '23 at 15:40
  • @Tom one table which looks like this :- Zpusob využití:rodinný dum – walle_eva Jan 19 '23 at 15:48
  • i was rbinding inside loop because i was getting timeout error after timeout i will atleast have some data to work with – walle_eva Jan 19 '23 at 15:50

1 Answers1

0

It is not clear how you want the final output, but here is how to scrape and skip links that are not working

library(rvest)
library(httr2)
library(tidyverse)

Given this data frame of links, notice the third one is not working:

df <- tibble(
  links = c(
    "https://vdp.cuzk.cz/vdp/ruian/stavebniobjekty/2146711",
    "https://vdp.cuzk.cz/vdp/ruian/stavebniobjekty/2146703",
    "https://vdp.cuzk.cz/vdp/ruian/stavebniobjekty/9999999",
    "https://vdp.cuzk.cz/vdp/ruian/stavebniobjekty/2146789"
  )
)

# A tibble: 4 × 1
  links                                                
  <chr>                                                
1 https://vdp.cuzk.cz/vdp/ruian/stavebniobjekty/2146711
2 https://vdp.cuzk.cz/vdp/ruian/stavebniobjekty/2146703
3 https://vdp.cuzk.cz/vdp/ruian/stavebniobjekty/9999999
4 https://vdp.cuzk.cz/vdp/ruian/stavebniobjekty/2146789

Create a function to scrape the table, specifically the third row:

get_info <- function(link) {
  cat("Scraping", link, "\n")
  link %>%
    read_html() %>%
    html_table() %>%
    pluck(2) %>%
    slice(3) %>%
    pull(2) 
}

And mutate() a new column with the info, NA if the link is not working. If the link is not working possibly() will throw NA (NA_character_) back instead of stopping the code.

df %>% 
  mutate(
    info = map_chr(links, possibly(get_info, otherwise = NA_character_))
  )

# A tibble: 4 × 2
  links                                                 info       
  <chr>                                                 <chr>      
1 https://vdp.cuzk.cz/vdp/ruian/stavebniobjekty/2146711 rodinný dům
2 https://vdp.cuzk.cz/vdp/ruian/stavebniobjekty/2146703 rodinný dům
3 https://vdp.cuzk.cz/vdp/ruian/stavebniobjekty/9999999 NA         
4 https://vdp.cuzk.cz/vdp/ruian/stavebniobjekty/2146789 rodinný dům
Chamkrai
  • 5,912
  • 1
  • 4
  • 14