how to properly close connection so I won't get "Error in file(con, "r") : all connections are in use" when using "readlines" and "tryCatch"

Question

I have a list of URLs (more than 4000) from a specific domain (pixilink.com) and what I want to do is to figure out if the provided domain is a picture or a video. To do this, I used the solutions provided here: How to write trycatch in R and Check whether a website provides photo or video based on a pattern in its URL and wrote the code shown below:

#Function to get the value of initial_mode from the URL
urlmode <- function(x){
  mycontent <- readLines(x)
  mypos <- grep("initial_mode = ", mycontent)
  
  if(grepl("0", mycontent[mypos])){
    return("picture")
  } else if(grepl("tour", mycontent[mypos])){
    return("video")
  } else{
    return(NA)
  }
}

Also, in order to prevent having error for URLs that don't exist, I used the code below:

readUrl <- function(url) {
  out <- tryCatch(
    {
      readLines(con=url, warn=FALSE)
      return(1)    
    },
    error=function(cond) {
      return(NA)
    },
    warning=function(cond) {    
      return(NA)
    },
    finally={
      message( url)
    }
  )    
  return(out)
}

Finally, I separated the list of URLs and pass it into the functions (here for instance, I used 1000 values from URL list) described above:

a <- subset(new_df, new_df$host=="www.pixilink.com")
vec <- a[['V']]
vec <- vec[1:1000] # only chose first 1000 rows

tt <- numeric(length(vec)) # checking validity of url
for (i in 1:length(vec)){
  tt[i] <- readUrl(vec[i])
  print(i)
}    
g <- data.frame(vec,tt)
g2 <- g[which(!is.na(g$tt)),] #only valid url

dd <- numeric(nrow(g2))
for (j in 1:nrow(g2)){
  dd[j] <- urlmode(g2[j,1])      
}    
Final <- cbind(g2,dd)
Final <- left_join(g, Final, by = c("vec" = "vec"))

I ran this code on a sample list of URLs with 100, URLs and it worked; however, after I ran it on whole list of URLs, it returned an error. Here is the error : Error in textConnection("rval", "w", local = TRUE) : all connections are in use Error in textConnection("rval", "w", local = TRUE) : all connections are in use

And after this even for sample URLs (100 samples that I tested before) I ran the code and got this error message : Error in file(con, "r") : all connections are in use

I also tried closeAllConnection after each recalling each function in the loop, but it didn't work. Can anyone explain what this error is about? is it related to the number of requests we can have from the website? what's the solution?

@IRTFM, I've tried that using `closeAllconnection()` and still got the same error message — Ross_you, Oct 30 '20 at 17:58

Dunois · Accepted Answer · 2020-10-31T13:38:03.977

So, my guess as to why this is happening is because you're not closing the connections that you're opening via tryCatch() and via urlmode() through the use of readLines(). I was unsure of how urlmode() was going to be used in your previous post so it had made it as simple as I could (and in hindsight, that was badly done, my apologies). So I took the liberty of rewriting urlmode() to try and make it a little bit more robust for what appears to be a more expansive task at hand.

I think the comments in the code should help, so take a look below:

#Updated URL mode function with better 
#URL checking, connection handling,
#and "mode" investigation
urlmode <- function(x){
  
  #Check if URL is good to go
  if(!httr::http_error(x)){
    
    #Test cases
    #x <- "www.pixilink.com/3"
    #x <- "https://www.pixilink.com/93320"
    #x <- "https://www.pixilink.com/93313"
    
    #Then since there are redirect shenanigans
    #Get the actual URL the input points to
    #It should just be the input URL if there is
    #no redirection
    #This is important as this also takes care of
    #checking whether http or https need to be prefixed
    #in case the input URL is supplied without those
    #(this can cause problems for url() below)
    myx <- httr::HEAD(x)$url
    
    #Then check for what the default mode is
    mycon <- url(myx)
    open(mycon, "r")
    mycontent <- readLines(mycon)
    
    mypos <- grep("initial_mode = ", mycontent)
    
    #Close the connection since it's no longer
    #necessary
    close(mycon)
    
    #Some URLs with weird formats can return 
    #empty on this one since they don't
    #follow the expected format.
    #See for example: "https://www.pixilink.com/clients/899/#3"
    #which is actually
    #redirected from "https://www.pixilink.com/3"
    #After that, evaluate what's at mypos, and always 
    #return the actual URL
    #along with the result
    if(!purrr::is_empty(mypos)){
      
      #mystr<- stringr::str_extract(mycontent[mypos], "(?<=initial_mode\\s\\=).*")
      mystr <- stringr::str_extract(mycontent[mypos], "(?<=\').*(?=\')")
      return(c(myx, mystr))
      #return(mystr)
      
      #So once all that is done, check if the line at mypos
      #contains a 0 (picture), tour (video)
      #if(grepl("0", mycontent[mypos])){
      #  return(c(myx, "picture"))
        #return("picture")
      #} else if(grepl("tour", mycontent[mypos])){
      #  return(c(myx, "video"))
        #return("video")
      #}
      
    } else{
      #Valid URL but not interpretable
      return(c(myx, "uninterpretable"))
      #return("uninterpretable")
    }
    
  } else{
    #Straight up invalid URL
    #No myx variable to return here
    #Just x
    return(c(x, "invalid"))
    #return("invalid")
  }
  
}


#--------
#Sample code execution
library(purrr)
library(parallel)
library(future.apply)
library(httr)
library(stringr)
library(progressr)
library(progress)


#All future + progressr related stuff
#learned courtesy 
#https://stackoverflow.com/a/62946400/9494044
#Setting up parallelized execution
no_cores <- parallel::detectCores()
#The above setup will ensure ALL cores
#are put to use
clust <- parallel::makeCluster(no_cores)
future::plan(cluster, workers = clust)

#Progress bar for sanity checking
progressr::handlers(progressr::handler_progress(format="[:bar] :percent :eta :message"))


#Website's base URL
baseurl <- "https://www.pixilink.com"

#Using future_lapply() to recursively apply urlmode()
#to a sequence of the URLs on pixilink in parallel
#and storing the results in sitetype
#Using a future chunk size of 10
#Everything is wrapped in with_progress() to enable the
#progress bar

#
range <- 93310:93350
#range <- 1:10000
progressr::with_progress({
  myprog <- progressr::progressor(along = range)
  sitetype <- do.call(rbind, future_lapply(range, function(b, x){
    myprog() ##Progress bar signaller
    myurl <- paste0(b, "/", x)
    cat("\n", myurl, " ")
    myret <- urlmode(myurl)
    cat(myret, "\n")
    return(c(myurl, myret))
  }, b = baseurl, future.chunk.size = 10))
  
})




#Converting into a proper data.frame
#and assigning column names
sitetype <- data.frame(sitetype)
names(sitetype) <- c("given_url", "actual_url", "mode")

#A bit of wrangling to tidy up the mode column
sitetype$mode <- stringr::str_replace(sitetype$mode, "0", "picture")


head(sitetype)
#                        given_url                     actual_url        mode
# 1 https://www.pixilink.com/93310 https://www.pixilink.com/93310     invalid
# 2 https://www.pixilink.com/93311 https://www.pixilink.com/93311     invalid
# 3 https://www.pixilink.com/93312 https://www.pixilink.com/93312 floorplan2d
# 4 https://www.pixilink.com/93313 https://www.pixilink.com/93313     picture
# 5 https://www.pixilink.com/93314 https://www.pixilink.com/93314 floorplan2d
# 6 https://www.pixilink.com/93315 https://www.pixilink.com/93315        tour

unique(sitetype$mode)
# [1] "invalid"     "floorplan2d" "picture"     "tour" 

#--------

Basically, urlmode() now opens and closes connections only when necessary, checks for URL validity, URL redirection, and also "intelligently" extracts the value assigned to initial_mode. With the help of future.lapply(), and the progress bar from the progressr package, this can now be applied quite conveniently in parallel to as many pixilink.com/<integer> URLs as desired. With a bit of wrangling thereafter, the results can be presented very tidily as a data.frame as shown.

As an example, I've demonstrated this for a small range in the code above. Note the commented out 1:10000 range in the code in this context: I let this code run the last couple of hours over this (hopefully sufficiently) large range of URLs to check for errors and problems. I can attest that I encountered no errors (only the regular warnings In readLines(mycon) : incomplete final line found on 'https://www.pixilink.com/93334'). For proof, I have the data from all 10000 URLs written to a CSV file that I can provide upon request (I don't fancy uploading that to pastebin or elsewhere unnecessarily). Due to oversight on my part, I forgot to benchmark that run, but I suppose I could do that later if performance metrics are desired/would be considered interesting.

For your purposes, I believe you can simply take the entire code snippet below and run it verbatim (or with modifications) by just changing the range assignment right before the with_progress(do.call(...)) step to a range of your liking. I believe this approach is simpler and does away with having to deal with multiple functions and such (and no tryCatch() messes to deal with).

Thank you so much Dunios for all the effort you put into this. I have a couple of questions but I'd rather start from the "error" message I got. In order to understand your code completely, I started with an example "www.pixilink.com/3". The problem was when I wanted to run `open(mycon, "r")` for this link, I got an error message `Error in open.connection(mycon, "r") : cannot open the connection`. I haven't got this error for other example, though. ThenI skipped this and ran the code for my url list and again on the 232rd url which was `https://www.pixilink.com/141451#mode=tour` I got the same — Ross_you, Nov 02 '20 at 18:27
the error message which said `Error in open.connection(mycon, "r") : cannot open the connection`. This link actually exists and I am not sure why this happens. Any comment on this? I can definitely skip these variables for now, but I just wanted to fully understand the code — Ross_you, Nov 02 '20 at 18:29
Once again I want to say thanks for the effort and I understand you should have the complete dataset to fix all issues. I shared it here so you can check and have a better understanding of what's going on in my dataset. Hopefully, you can help me fix this issue — Ross_you, Nov 02 '20 at 18:41
Another example of this type of error is this link `https://www.pixilink.com/140079#mode=tour`, and when I run it I get an error like this : `Error in open.connection(mycon, "r") : cannot open the connection In addition: Warning message: In open.connection(mycon, "r") : cannot open URL 'https://www.pixilink.com/140079#mode=tour': HTTP status was '400 Bad Request'` This link also exists and I am not sure why R cannot establish a connection with the link. — Ross_you, Nov 02 '20 at 19:22
I dug more into it and found a few more example URLs in my list where I get the same error message. Here is a list of other examples: `bad_URL <- c("https://www.pixilink.com/141451#mode=tour", "https://www.pixilink.com/128900#mode=tour", "https://www.pixilink.com/124229#mode=tour", "https://www.pixilink.com/138619#mode=0")` I understand that without having my dataset, there is no way for you to figure out these issues; however, my question is a general question that why although these links are valid, R can't make establish a connection? Any solution to skip this error? — Ross_you, Nov 02 '20 at 20:48
This is really strange. I just tried out those URLs and all of them work just fine. I don't think it is the function itself that is causing the errors. I presume this function is being used in a script with other components in it? Are any of those components opening connections as well? — Dunois, Nov 03 '20 at 09:39
No I ran exactly the same function on the test URLs you provided. Yes, it's strange! I asked this on StackOverflow and someone else mentioned that he ran it on Linux and he got no error! Are you also using `Linux`? I am using Windows 10 and maybe that's the reason. In order to make the code work for me, I had to separate all of these links which result in a connection error (47 links) and then the code ran perfectly for me — Ross_you, Nov 03 '20 at 17:18
Yep, I'm on Linux. That definitely might have something to do with this then. I didn't think the OS would influence anything here; appears I was wrong. Hmm what do you mean by you had to "separate all these links"? How were you feeding the links to the function earlier (as opposed to now)? Good on you for investigating this thoroughly!! — Dunois, Nov 04 '20 at 10:23
@Roozbeh_you I just checked out your other post, and I concur that it could be because of the fragments, although that said, I must confess when I tried the links with the fragments in there, they still worked fine. To be honest, I hadn't really thought about this, because I was operating under the assumption that it is the default fragment itself that you were trying to identify (and that the URL being passed to `urlmode()` would thus not contain a fragment). I think we can circumvent this easily by adding a line to strip all text following the domain name. — Dunois, Nov 04 '20 at 10:31
exactly, I've done the same things and used `gsub` to remove the fragment and then ran your function after that. It worked perfectly, thanks so much for your help and clear solution — Ross_you, Nov 04 '20 at 18:37
You're welcome!! I'm glad I was able to help. This was a very interesting little project. — Dunois, Nov 05 '20 at 15:03

how to properly close connection so I won't get "Error in file(con, "r") : all connections are in use" when using "readlines" and "tryCatch"

1 Answers1