So, my guess as to why this is happening is because you're not closing the connections that you're opening via tryCatch()
and via urlmode()
through the use of readLines()
. I was unsure of how urlmode()
was going to be used in your previous post so it had made it as simple as I could (and in hindsight, that was badly done, my apologies). So I took the liberty of rewriting urlmode()
to try and make it a little bit more robust for what appears to be a more expansive task at hand.
I think the comments in the code should help, so take a look below:
#Updated URL mode function with better
#URL checking, connection handling,
#and "mode" investigation
urlmode <- function(x){
#Check if URL is good to go
if(!httr::http_error(x)){
#Test cases
#x <- "www.pixilink.com/3"
#x <- "https://www.pixilink.com/93320"
#x <- "https://www.pixilink.com/93313"
#Then since there are redirect shenanigans
#Get the actual URL the input points to
#It should just be the input URL if there is
#no redirection
#This is important as this also takes care of
#checking whether http or https need to be prefixed
#in case the input URL is supplied without those
#(this can cause problems for url() below)
myx <- httr::HEAD(x)$url
#Then check for what the default mode is
mycon <- url(myx)
open(mycon, "r")
mycontent <- readLines(mycon)
mypos <- grep("initial_mode = ", mycontent)
#Close the connection since it's no longer
#necessary
close(mycon)
#Some URLs with weird formats can return
#empty on this one since they don't
#follow the expected format.
#See for example: "https://www.pixilink.com/clients/899/#3"
#which is actually
#redirected from "https://www.pixilink.com/3"
#After that, evaluate what's at mypos, and always
#return the actual URL
#along with the result
if(!purrr::is_empty(mypos)){
#mystr<- stringr::str_extract(mycontent[mypos], "(?<=initial_mode\\s\\=).*")
mystr <- stringr::str_extract(mycontent[mypos], "(?<=\').*(?=\')")
return(c(myx, mystr))
#return(mystr)
#So once all that is done, check if the line at mypos
#contains a 0 (picture), tour (video)
#if(grepl("0", mycontent[mypos])){
# return(c(myx, "picture"))
#return("picture")
#} else if(grepl("tour", mycontent[mypos])){
# return(c(myx, "video"))
#return("video")
#}
} else{
#Valid URL but not interpretable
return(c(myx, "uninterpretable"))
#return("uninterpretable")
}
} else{
#Straight up invalid URL
#No myx variable to return here
#Just x
return(c(x, "invalid"))
#return("invalid")
}
}
#--------
#Sample code execution
library(purrr)
library(parallel)
library(future.apply)
library(httr)
library(stringr)
library(progressr)
library(progress)
#All future + progressr related stuff
#learned courtesy
#https://stackoverflow.com/a/62946400/9494044
#Setting up parallelized execution
no_cores <- parallel::detectCores()
#The above setup will ensure ALL cores
#are put to use
clust <- parallel::makeCluster(no_cores)
future::plan(cluster, workers = clust)
#Progress bar for sanity checking
progressr::handlers(progressr::handler_progress(format="[:bar] :percent :eta :message"))
#Website's base URL
baseurl <- "https://www.pixilink.com"
#Using future_lapply() to recursively apply urlmode()
#to a sequence of the URLs on pixilink in parallel
#and storing the results in sitetype
#Using a future chunk size of 10
#Everything is wrapped in with_progress() to enable the
#progress bar
#
range <- 93310:93350
#range <- 1:10000
progressr::with_progress({
myprog <- progressr::progressor(along = range)
sitetype <- do.call(rbind, future_lapply(range, function(b, x){
myprog() ##Progress bar signaller
myurl <- paste0(b, "/", x)
cat("\n", myurl, " ")
myret <- urlmode(myurl)
cat(myret, "\n")
return(c(myurl, myret))
}, b = baseurl, future.chunk.size = 10))
})
#Converting into a proper data.frame
#and assigning column names
sitetype <- data.frame(sitetype)
names(sitetype) <- c("given_url", "actual_url", "mode")
#A bit of wrangling to tidy up the mode column
sitetype$mode <- stringr::str_replace(sitetype$mode, "0", "picture")
head(sitetype)
# given_url actual_url mode
# 1 https://www.pixilink.com/93310 https://www.pixilink.com/93310 invalid
# 2 https://www.pixilink.com/93311 https://www.pixilink.com/93311 invalid
# 3 https://www.pixilink.com/93312 https://www.pixilink.com/93312 floorplan2d
# 4 https://www.pixilink.com/93313 https://www.pixilink.com/93313 picture
# 5 https://www.pixilink.com/93314 https://www.pixilink.com/93314 floorplan2d
# 6 https://www.pixilink.com/93315 https://www.pixilink.com/93315 tour
unique(sitetype$mode)
# [1] "invalid" "floorplan2d" "picture" "tour"
#--------
Basically, urlmode()
now opens and closes connections only when necessary, checks for URL validity, URL redirection, and also "intelligently" extracts the value assigned to initial_mode
. With the help of future.lapply()
, and the progress bar from the progressr
package, this can now be applied quite conveniently in parallel to as many pixilink.com/<integer>
URLs as desired. With a bit of wrangling thereafter, the results can be presented very tidily as a data.frame
as shown.
As an example, I've demonstrated this for a small range in the code above. Note the commented out 1:10000
range in the code in this context: I let this code run the last couple of hours over this (hopefully sufficiently) large range of URLs to check for errors and problems. I can attest that I encountered no errors (only the regular warnings In readLines(mycon) : incomplete final line found on 'https://www.pixilink.com/93334'
). For proof, I have the data from all 10000 URLs written to a CSV file that I can provide upon request (I don't fancy uploading that to pastebin or elsewhere unnecessarily). Due to oversight on my part, I forgot to benchmark that run, but I suppose I could do that later if performance metrics are desired/would be considered interesting.
For your purposes, I believe you can simply take the entire code snippet below and run it verbatim (or with modifications) by just changing the range
assignment right before the with_progress(do.call(...))
step to a range of your liking. I believe this approach is simpler and does away with having to deal with multiple functions and such (and no tryCatch()
messes to deal with).