I have a project that's downloading ~20 million PDFs multithreaded on an ec2. I'm most proficient in R and it's a one off so my initial assessment was that the time savings from bash scripting wouldn't be enough to justify the time spent on the learning curve. So I decided just to call curl from within an R script. The instance is a c4.8xlarge, rstudio server over ubuntu with 36 cores and 60 gigs of memory.
With any method I've tried it runs up to the max ram fairly quickly. It runs alright but I'm concerned swapping the memory is slowing it down. curl_download or curl_fetch_disk work much more quickly than the native download.file function (one pdf per every .05 seconds versus .2) but those both run up to max memory extremely quickly and then seem to populate the directory with empty files. With the native function I was dealing with the memory problem by suppressing output with copious usage of try() and invisible(). That doesn't seem to help at all with the curl package.
I have three related questions if anyone could help me with them.
(1) Is my understanding of how memory is utilized correct in that needlessly swapping memory would cause the script to slow down?
(2) curl_fetch_disk is supposed to be writing direct to disk, does anyone have any idea as to why it would be using so much memory?
(3) Is there any good way to do this in R or am I just better off learning some bash scripting?
Current method with curl_download
getfile_sweep.fun <- function(url
,filename){
invisible(
try(
curl_download(url
,destfile=filename
,quiet=T
)
)
)
}
Previous method with native download.file
getfile_sweep.fun <- function(url
,filename){
invisible(
try(
download.file(url
,destfile=filename
,quiet=T
,method="curl"
)
)
)
}
parLapply loop
len <- nrow(url_sweep.df)
gc.vec <- unlist(lapply(0:35, function(x) x + seq(
from=100,to=len,by=1000)))
gc.vec <- gc.vec[order(gc.vec)]
start.time <- Sys.time()
ptm <- proc.time()
cl <- makeCluster(detectCores()-1,type="FORK")
invisible(
parLapply(cl,1:len, function(x){
invisible(
try(
getfile_sweep.fun(
url = url_sweep.df[x,"url"]
,filename = url_sweep.df[x,"filename"]
)
)
)
if(x %in% gc.vec){
gc()
}
}
)
)
stopCluster(cl)
Sweep.time <- proc.time() - ptm
Sample of data -
Sample of url_sweep.df: https://www.dropbox.com/s/anldby6tcxjwazc/url_sweep_sample.rds?dl=0
Sample of existing.filenames: https://www.dropbox.com/s/0n0phz4h5925qk6/existing_filenames_sample.rds?dl=0