2

I have a project that's downloading ~20 million PDFs multithreaded on an ec2. I'm most proficient in R and it's a one off so my initial assessment was that the time savings from bash scripting wouldn't be enough to justify the time spent on the learning curve. So I decided just to call curl from within an R script. The instance is a c4.8xlarge, rstudio server over ubuntu with 36 cores and 60 gigs of memory.

With any method I've tried it runs up to the max ram fairly quickly. It runs alright but I'm concerned swapping the memory is slowing it down. curl_download or curl_fetch_disk work much more quickly than the native download.file function (one pdf per every .05 seconds versus .2) but those both run up to max memory extremely quickly and then seem to populate the directory with empty files. With the native function I was dealing with the memory problem by suppressing output with copious usage of try() and invisible(). That doesn't seem to help at all with the curl package.

I have three related questions if anyone could help me with them.

(1) Is my understanding of how memory is utilized correct in that needlessly swapping memory would cause the script to slow down?

(2) curl_fetch_disk is supposed to be writing direct to disk, does anyone have any idea as to why it would be using so much memory?

(3) Is there any good way to do this in R or am I just better off learning some bash scripting?

Current method with curl_download

getfile_sweep.fun <- function(url
                          ,filename){
  invisible(
    try(
      curl_download(url
                ,destfile=filename
                ,quiet=T
      )
    )
  )
}

Previous method with native download.file

getfile_sweep.fun <- function(url
                            ,filename){
  invisible(
    try(
      download.file(url
                  ,destfile=filename
                  ,quiet=T
                  ,method="curl"
                  )
    )
  )
}

parLapply loop

len <- nrow(url_sweep.df)

gc.vec <- unlist(lapply(0:35, function(x) x + seq(
from=100,to=len,by=1000)))

gc.vec <- gc.vec[order(gc.vec)]

start.time <- Sys.time()

ptm <- proc.time()
cl <- makeCluster(detectCores()-1,type="FORK")
invisible(
  parLapply(cl,1:len, function(x){
    invisible(
      try(
        getfile_sweep.fun(
          url = url_sweep.df[x,"url"]
          ,filename = url_sweep.df[x,"filename"]
        )
      )
    )
    if(x %in% gc.vec){
      gc()
    }
  }
  )
)
stopCluster(cl)
Sweep.time <- proc.time() - ptm

Sample of data -

Sample of url_sweep.df: https://www.dropbox.com/s/anldby6tcxjwazc/url_sweep_sample.rds?dl=0

Sample of existing.filenames: https://www.dropbox.com/s/0n0phz4h5925qk6/existing_filenames_sample.rds?dl=0

wjb_hwe
  • 63
  • 6
  • one think to remember writing to hard drive is one of the slowest part of the operation in a system. – BlooB Aug 28 '17 at 20:17
  • 1
    So you're saying the issue could be that curl is attempting to write to disk faster than the drive is capable of and it's holding it in memory? Each pdf is about 30KB. If I'm getting 20 files per second I need write speeds of 6MB/s. AWS indicates my ec2 and volume should be able to handle that, or am I mistaken? https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EBSVolumeTypes.html?icmpid=docs_ec2_console – wjb_hwe Aug 28 '17 at 21:10
  • 1- you are current, but if memory swap starts to happen then you are nor purely working with RAM anymore and I think R would have harder and harder time to find available continues memory spaces, i am also looking into this.(I am not sure) 2- have you tried the native download.file with wb mode? – BlooB Aug 28 '17 at 22:13
  • 1 - Wow. That's a really great point about swapping and I/O. That didn't occur to me. 2 - I haven't tried wb mode just yet although that's on my list of things to try. I'll put it at the top for today. I will ultimately be parsing the PDFs to scrape data from them and wasn't sure what the effect of downloading them as binaries would be. Currently they seem to be some form of text and parsing them is extremely fast. – wjb_hwe Aug 29 '17 at 14:49
  • try limiting the memory usage, maybe that would help avoid swap. – BlooB Aug 29 '17 at 19:50
  • another thing would be if you use Rcurl, you can use CFILE, wich would give you the ability to continue working with R while having C-level file handling capabilities – BlooB Aug 29 '17 at 19:54
  • o and other thing, From what i heard when downloading to disk, or using curl_fetch_* you must check manually to ensure request were completed, otherwise on its own it will not do anything to ensure the requests are fully completed – BlooB Aug 29 '17 at 20:09
  • 1
    Upped the EBS to provisioned SSD w/ 6000 IOPs/sec so writing to disk shouldn't be a bottleneck. Same issue. Now the script is falling over. I'm starting to think it has to do with the fork cluster. It runs up to 60GB of ram within minutes or seconds. That can't be from PDFs. I also can't reproduce the problem on my desktop (8 core 64gb macos sierra). – wjb_hwe Aug 29 '17 at 22:23
  • So when you run this on a small machine everything is ok? – BlooB Aug 29 '17 at 22:44
  • Hey can you Also gather some info on work load of cpu cores when you code is running? – BlooB Aug 29 '17 at 22:58
  • Well, quick update. I tinkered A LOT and managed to get it running again without falling over or running out of memory. The big changes were removing all of the try() statements, including an outer loop so the dataframe being used is 500k rows instead of 7m and using 75% of the cores to leave some for swapping or writing to disk. – wjb_hwe Aug 30 '17 at 15:43
  • The smaller machine works great. It stays at a steady amount of memory even with the try() statements and using all but 1 core. I tend to attribute that to the way the parallel package implements a forking cluster within the different operating systems. I'm just not knowledgable to say. – wjb_hwe Aug 30 '17 at 15:44
  • What format would you like the CPU workload data in? AWS management console or something from the ubuntu CLI? If the latter it will probably be easiest if you give me a command to run. I'm only mildly proficient with linux. – wjb_hwe Aug 30 '17 at 15:45
  • I also believe the problem is with the way parallel execution is set up. look at this page eaither use something like the top command or, or use the bash script at the bottom with 3 votes. https://stackoverflow.com/questions/3342889/how-do-i-measure-separate-cpu-core-usage-for-a-process?rq=1 – BlooB Aug 30 '17 at 18:27
  • from what i head AWS management console pretty accurate as well. – BlooB Aug 30 '17 at 18:29
  • 1
    Yeah, I'm more and more convinced it's how the parallel usage is set up. New to posting on stackexchange so it's going to take me a sec to find a way to put up the CPU usage. Basically it's very low until the loop ends or the memory gets maxed out and then it spikes. I've managed to get the script running again but I still feel like it could be working faster. I'll post updated code shortly. – wjb_hwe Aug 30 '17 at 21:53
  • One thing to consider is this: In general running jobs in parallel incurs overhead, now more cores you have, you will see the effects more. when you pass a lot of jobs that take very very little time (think smaller than second) this will results in increase of overhead related to constantly pushing jobs. try to limit the core to 8 just like your desktop and try your code? does the code run fine? if yes than try to increase the workload as you increase the cores available to the program. – BlooB Aug 31 '17 at 00:16
  • yeah, one thing I've done is held a number of cores in reserve. when I make the cluster it's: cl <- makeCluster(round(detectCores()*.75),type="FORK"). That seems to help. Still having some odd issues. Though. Would you like to put some of what you've suggested up as an answer so when we get it figured out I can make it the top answer and give you the points? I really appreciate the help. – wjb_hwe Sep 01 '17 at 17:56
  • Also, stackexchange keeps suggesting starting a direct conversation to avoid extended comments but I'm to new to do so. Could you initiate one? – wjb_hwe Sep 01 '17 at 17:57
  • hmmm, up to a point i know they never did implement anything in the area of direct or private conversation i will check tho – BlooB Sep 02 '17 at 17:09
  • did a link appear suggesting to move to chat? – BlooB Sep 02 '17 at 17:46

2 Answers2

1

Notes:

1- I do not have such powerful system available to me, so I cannot reproduce every issue mentioned.

2- All the comments are being summarized here

3- It was stated that machine received an upgrade: EBS to provisioned SSD w/ 6000 IOPs/sec, however the issue persists

Possible issues:

A- if memory swap starts to happen then you are nor purely working with RAM anymore and I think R would have harder and harder time to find available continues memory spaces.

B- work load and the time it takes to finish the workload, compared to the number of cores

c- parallel set up, and fork cluster

Possible solutions and troubleshooting:

B- Limiting memory usage

C- Limiting number of cores

D- If the code runs fine on a smaller machine like personal desktop than issue is with how the parallel usage is setup, or something with fork cluster.

Things to still try:

A- In general running jobs in parallel incurs overhead, now more cores you have, you will see the effects more. when you pass a lot of jobs that take very very little time (think smaller than second) this will results in increase of overhead related to constantly pushing jobs. try to limit the core to 8 just like your desktop and try your code? does the code run fine? if yes than try to increase the workload as you increase the cores available to the program.

Start with lower end of spectrum of number of cores and amount of ram an increase them as you increase the workload and see where the fall happens.

B- I will post a summery about Parallelism in R, this might help you catch something that we have missed

What worked: Limiting the number of cores has fixed the issue. As mentioned by OP, he has also made other changes to the code, however i do not have access to them.

BlooB
  • 955
  • 10
  • 23
  • 1
    Suggestion C has effectively solved the memory issue. While I have implemented a variety of other changes I believe limiting the number of cores with makeCluster(round(detectCores()*.7),type="FORK) has left enough resources to handle the overhead and memory usage remains constant at around 8-9 GB. Although there are other issues remain I think they are separate from those described in the original post and I believe @dirty_feri has solved the problem. I'll include the updated code in the original post shortly. – wjb_hwe Sep 06 '17 at 12:48
  • I am glad to hear things are slowly working out, if the memory problem is resolved accept my answer with the TICK mark and open another post regarding any other issue you might have. – BlooB Sep 06 '17 at 16:56
0

You can use the async interface instead. Short example below:

cb_done <- function(resp) {
    filename <- basename(urltools::path(resp$url))
    writeBin(resp$content, filename)
}

pool <- curl::new_pool()
for (u in urls) curl::curl_fetch_multi(u, pool = pool, done = cb_done)
curl::multi_run(pool = pool)
Artem Klevtsov
  • 9,193
  • 6
  • 52
  • 57
  • After reading documentation on curl_fetch_multi, this is an interesting suggestion. The initial download has completed but I'm going to do a second pass in the near future to mop up what was missed (hundreds of thousands or low millions). I'll try this alongside what I wound up using and post the results. – wjb_hwe Oct 16 '17 at 16:56