0

In snakemake I want to prevent to run out of memory, this is in principle possible by specify memory limits per rule, i.e.:

rule a:
    input:     ...
    output:    ...
    resources:
        mem_mb=100
    shell:
        "..."

I was wondering on best-practices on how to work out sensible values. For the sake of argument, say the input size is constant and independent of threads, so each run is expected to have a constant upper limit.

A potential approach is to benchmark this rule and take values from there (+ some sort of safety margin). An example output of such a benchmark looks like this:

s       h:m:s   max_rss max_vms     max_uss max_pss io_in   io_out  mean_load   cpu_time
92.3651 0:01:32 209.23  15008.50    136.93  148.72  0.05    0.22    302.65      279.61

The values are in a related thread here, but I was wondering if this is the right approach, and if so, which value to base it on. I.e. if max_mvs (Maximum “Virtual Memory Size”) was a good proxy, would mem_mb have be 15008?

1 Answers1

1

Just throwing in some thoughts:

Local constraints are hard. I think the values you get will depend on how snakemake forks jobs, what libraries are loaded, etc. Based on this discussion it seems like rss is better than vms. In either case, I would shoot for something like 60-80% utilization of memory since the benchmark isn't perfect.

For job schedulers, ultimately it's on the scheduler to decide if you used too much memory. I would plug reportseff for quickly looking at slurm resource usage. You can basically guess, check refine until you get reasonable efficiency.

I have attempted to use input file sizes to regress how much time and memory a given job would need based on slurm resources. I found memory was usually O(1) while time was closer to O(n) on input file sizes for bioinformatics problems that are primarily IO bound. If you want to make an educated guess you should also record the input file sizes.

Finally, you may be prematurely optimizing. If you are doing this locally, it is probably for a small set of jobs. If you ran with -j 1 you could be finished by the time you had worked out your memory usage! Though I'm interested to hear what you find.

Troy Comi
  • 1,579
  • 3
  • 12
  • Thanks, I had no idea are so many layers to it. As for the forking part, yes, it probably depends on that, but then the benchmark is also run by snakemake, so that should be consistent in how the memory is benchmarked vs how it is set. And of course, the input size, -j etc does have a big impact. But in general, say `rss` is the way to go (and all assumption hold), I'd just plug in the value for `rss` into `max_vms` (plus some safty margin), right? – Sebastian Müller Feb 16 '22 at 15:45
  • Yeah, add 25-50% on top and probably save 20% of your system memory when invoking snakemake so you don't brick your machine. – Troy Comi Feb 19 '22 at 18:37