5

On this code when I use for loop or the function lapply I get the following error

"Error in get_entrypoint (debug_port):
  Cannot connect R to Chrome. Please retry. "


library(rvest)
library(xml2) #pull html data
library(selectr) #for xpath element

url_stackoverflow_rmarkdown <- 
  'https://stackoverflow.com/questions/tagged/r-markdown?tab=votes&pagesize=50'

web_page <- read_html(url_stackoverflow_rmarkdown)

questions_per_page <- html_text(html_nodes(web_page, ".page-numbers.current"))[1]

link_questions <- html_attr(html_nodes(web_page, ".question-hyperlink")[1:questions_per_page], 
                            "href")

setwd("~/WebScraping_chrome_print_to_pdf") 

for (i in 1:length(link_questions)) {
  question_to_pdf <- paste0("https://stackoverflow.com",
                            link_questions[i])

  pagedown::chrome_print(question_to_pdf) 
}

Is it possible to build a for loop() or use lapply to repeat the code from where it break? That is, from the last i value without breaking the code?

Many thanks

Laura
  • 675
  • 10
  • 32
  • 6
    Maybe `tryCatch(pagedown::chrome_print(etc), error = function(e) print(e))`. – Rui Barradas Dec 10 '19 at 18:21
  • @RuiBarradas many thanks! It worked well though, he skipped some elements of "question_to_pdf". Is there a way to try to "go back" and try again for the skiped elements? – Laura Dec 10 '19 at 18:45
  • Maybe return the index `i` instead of just printing the error `e`. Then `i` could be used to retry. – Rui Barradas Dec 10 '19 at 21:02
  • You can try to put information regarding problematic `i`s as part of your `tryCatch()` – DJV Dec 10 '19 at 21:13
  • @DJV yes! I tried to include some `if else()`, but I dont know how to work with thie `tryCatch()`function. Any help? – Laura Dec 10 '19 at 21:15
  • @Laura Please see my answer – DJV Dec 10 '19 at 21:45
  • 5
    you might like `purrr::safely()` – moodymudskipper Dec 16 '19 at 21:23
  • I wanted to stop the code if something happens, I found `stop()` does exactly what I wanted. Also, If someone wants to stop the code if a condition happens and raise an Error, you can use `stop("Error: Text of your error")` – Fabian Pino May 10 '23 at 15:27

2 Answers2

3

I edited @Rui Barradas idea of tryCatch(). You can try to do something like below. The IsValues will get either the link value or bad is.

IsValues <- list()
for (i in 1:length(link_questions)) {
  question_to_pdf <- paste0("https://stackoverflow.com",
                            link_questions[i])

  IsValues[[i]] <- tryCatch(
    {
      message(paste("Converting", i))

      pagedown::chrome_print(question_to_pdf)
    },
    error=function(cond) {
      message(paste("Cannot convert", i))
      # Choose a return value in case of error
      return(i)
    }) 
}

Than, you can rbind your values and extract the bad is:

do.call(rbind, IsValues)[!grepl("\\.pdf$", do.call(rbind, IsValues))]

[1] "3"  "5"  "19" "31"

You can read more about tryCatch() in this answer.

DJV
  • 4,743
  • 3
  • 19
  • 34
  • Great answer - just a sidenote if you want to grepl for the pdf"s I would escape the dot eg pattern = "\\.pdf$" [the $ makes regex look at the end of the string], or use fixed = TRUE paramater. – GWD Dec 20 '19 at 10:00
  • Thank you for your comment. I should definitely work on my regex skills! – DJV Dec 21 '19 at 16:40
1

Based on your example, it looks like you have two errors to contend with. The first error is the one you mention in your question. It is also the most frequent error:

Error in get_entrypoint (debug_port): Cannot connect R to Chrome. Please retry.

The second error arises when there are links in the HTML that return 404:

Failed to generate output. Reason: Failed to open https://lh3.googleusercontent.com/-bwcos_zylKg/AAAAAAAAAAI/AAAAAAAAAAA/AAnnY7o18NuEdWnDEck_qPpn-lu21VTdfw/mo/photo.jpg?sz=32 (HTTP status code: 404)

The key phrase in the first error is "Please retry". As far as I can tell, chrome_print sometimes has issues connecting to Chrome. It seems to be fairly random, i.e. failed connections in one run will be fine in the next, and vice versa. The easiest way to get around this issue is to just keep trying until it connects.

I can't come up with any fix for the second error. However, it doesn't seem to come up very often, so it might make sense to just record it and skip to the next URL.

Using the following code I'm able to print 48 of 50 pages. The only two I can't get to work have the 404 issue I describe above. Note that I use purrr::safely to catch errors. Base R's tryCatch will also work fine, but I find safely to be a little more convient. That said, in the end it's really just a matter of preference.

Also note that I've dealt with the connection error by utilizing repeat within the for loop. R will keep trying to connect to Chrome and print until it is either successful, or some other error pops up. I didn't need it, but you might want to include a counter to set an upper threshold for the number of connection attempts:

quest_urls <- paste0("https://stackoverflow.com", link_questions)
errors <- NULL

safe_print <- purrr::safely(pagedown::chrome_print)

for (qurl in quest_urls){
    repeat {
        output <- safe_print(qurl)
        if (is.null(output$error)) break
        else if (grepl("retry", output$error$message)) next
        else {errors <- c(errors, `names<-`(output$error$message, qurl)); break}
    }
}
  • Gladly! I've also gotten into contact with the developers. You can follow the issue [here](https://github.com/rstudio/pagedown/issues/158) if you're interested. –  Dec 21 '19 at 02:54