What is the best way to avoid overloading a parallel file-system when running embarrassingly parallel jobs?

Question

We have a problem which is embarrassingly parallel - we run a large number of instances of a single program with a different data set for each; we do this simply by submitting the application many times to the batch queue with different parameters each time.

However with a large number of jobs, not all of them complete. It does not appear to be a problem in the queue - all of the jobs are started.

The issue appears to be that with a large number of instances of the application running, lots of jobs finish at roughly the same time and thus all try to write out their data to the parallel file-system at pretty much the same time.

The issue then seems to be that either the program is unable to write to the file-system and crashes in some manner, or just sits there waiting to write and the batch queue system kills the job after it's been sat waiting too long. (From what I have gathered on the problem, most of the jobs that fail to complete, if not all, do not leave core files)

What is the best way to schedule disk-writes to avoid this problem? I mention our program is embarrassingly parallel to highlight the fact the each process is not aware of the others - they cannot talk to each other to schedule their writes in some manner.

Although I have the source-code for the program, we'd like to solve the problem without having to modify this if possible as we don't maintain or develop it (plus most of the comments are in Italian).

I have had some thoughts on the matter:

Each job write to the local (scratch) disk of the node at first. We can then run another job which checks every now and then what jobs have completed and moves the files from the local disks to the parallel file-system.
Use an MPI wrapper around the program in master/slave system, where the master manages a queue of jobs and farms these off to each slave; and the slave wrapper runs the applications and catches the exception (could I do this reliably for a file-system timeout in C++, or possibly Java?), and sends a message back to the master to re-run the job

In the meantime I need to pester my supervisors for more information on the error itself - I've never run into it personally, but I haven't had to use the program for a very large number of datasets (yet).

In case it's useful: we run Solaris on our HPC system with the SGE (Sun GridEngine) batch queue system. The file-system is NFS4, and the storage servers also run Solaris. The HPC nodes and storage servers communicate over fibre channel links.

I think more info on the error is needed. If the app is crashing, then that's clearly more than just an I/O bottleneck. — Oliver Charlesworth, Apr 11 '11 at 19:56
Sounds familiar. I regularly overload our NFS-server with too many jobs. — ebo, Apr 11 '11 at 20:03
I can only agree -- I'm the person that's been told to fix the problem! From what I gather most (if not all) of the jobs failing to finish don't leave core files. I'll update tomorrow with more info when I get it. Perhaps I should have asked for good strategies to avoid a bottleneck. — Joseph Earl, Apr 11 '11 at 20:08
you need to throttle the writing some way as a start to profiling and diagnosing what is causing the bottle neck. There is always some finite shared resource in every parallel system, I/O always being one of them. — , Apr 11 '11 at 20:28
What about using the facilities of the batch queue system? Perhaps start the jobs in staggered groups? Or if the problem is simply that too many jobs are running for some resource, set a limit on the number of simultaneous jobs. Unless you want to randomly try solutions, you need to figure out the cause of the crashes. — M. S. B., Apr 12 '11 at 01:22

score 7 · Accepted Answer · answered Apr 11 '11 at 20:22

7

Most parallel file systems, particularly those at supercomputing centres, are targetted for HPC applications, rather than serial-farm type stuff. As a result, they're painstakingly optimized for bandwidth, not for IOPs (I/O operations per sec) - that is, they are aimed at big (1000+ process) jobs writing a handful of mammoth files, rather than zillions of little jobs outputting octillions of tiny little files. It is all to easy for users to run something that runs fine(ish) on their desktop and naively scale up to hundreds of simultaneous jobs to starve the system of IOPs, hanging their jobs and typically others on the same systems.

The main thing you can do here is aggregate, aggregate, aggregate. It would be best if you could tell us where you're running so we can get more information on the system. But some tried-and-true strategies:

If you are outputting many files per job, change your output strategy so that each job writes out one file which contains all the others. If you have local ramdisk, you can do something as simple as writing them to ramdisk, then tar-gzing them out to the real filesystem.
Write in binary, not in ascii. Big data never goes in ascii. Binary formats are ~10x faster to write, somewhat smaller, and you can write big chunks at a time rather than a few numbers in a loop, which leads to:
Big writes are better than little writes. Every IO operation is something the file system has to do. Make few, big, writes rather than looping over tiny writes.
Similarly, don't write in formats which require you to seek around to write in different parts of the file at different times. Seeks are slow and useless.
If you're running many jobs on a node, you can use the same ramdisk trick as above (or local disk) to tar up all the jobs' outputs and send them all out to the parallel file system at once.

The above suggestions will benefit the I/O performance of your code everywhere, not juston parallel file systems. IO is slow everywhere, and the more you can do in memory and the fewer actual IO operations you execute, the faster it will go. Some systems may be more sensitive than others, so you may not notice it so much on your laptop, but it will help.

Similarly, having fewer big files rather than many small files will speed up everything from directory listings to backups on your filesystem; it is good all around.

answered Apr 11 '11 at 20:22

Jonathan Dursi

50,107
9
127
158

Please could you expand on what you mean when you said - "tell us where am I running"? As you guessed, system *is* optimized for HPC work (some jobs we run here use 1000s of cores for over half a year). – Joseph Earl Apr 11 '11 at 20:37
Each job reads in 3 files (~ 5kb, 100kb, 15kb), and outputs a couple of files which are typically quite small (~1mb and 5kb). No seeking goes on, it just appends information to the output files. This behaviour itself cannot really be changed. Binary could become issue - the data may well be transferred to a system with different endianness. Tarring may be a good option. As I've said I can't really change the behaviour of the application itself, so this would have to be in combination with writing the ASCII data to a local disk, then having job come and tar these up and transfer them. – Joseph Earl Apr 11 '11 at 20:43
Yes, another issue is other jobs from other users are running on the same system, and sometimes these don't play so nice. – Joseph Earl Apr 11 '11 at 20:46
You can browse our computing resources here if that's what you meant: http://icc.dur.ac.uk/index.php?content=Computing/Computing - although it's missing our latest and greatest system for some reason. – Joseph Earl Apr 11 '11 at 20:47
Yeah, I was just wondering what the file system was, blocksize was, things like that. Also, whether you were likely running multiple jobs per node - are you? I didn't realize you were at ICC; is the code something I would recognize? And are you saying the 1MB is ascii? That's not really good, for both the speed and IOPS reasons I suggested. HDF5, or NetCDF, are nice binary formats that do the endian-conversion for you. – Jonathan Dursi Apr 11 '11 at 20:57
Yes 1MB is ASCII. The code calculates the spectral energy density of galaxy taking into account the effects of dust absorption and re-emission and is called GRASIL. It isn't something we developed in Durham - it was developed by group in Trieste. You can find info + source code of an old version of the program here http://adlibitum.oat.ts.astro.it/silva/grasil/grasil.html – Joseph Earl Apr 11 '11 at 21:01
Oh and yes, with embarrassingly parallel jobs there could be unrelated jobs on the same node. With parallel jobs you can set it so only your processes run on each node. Although this is just how the batch system is set up - I might be able to ask for the job to only be run exclusively on nodes -- I will ask the sysadmin tomorrow. – Joseph Earl Apr 11 '11 at 21:08
1

I see what you mean about not being able to change the behaviour -- looks like they don't distribute source? Bah. If you can take a whole node (presumably if you request a node's worth of cores you'll preferentially get scheduled a whole node) you can use gnu parallel (eg https://support.scinet.utoronto.ca/wiki/index.php/User_Serial#Serial_jobs_of_varying_duration) to run N (or maybe 2-3 N) tasks in one job; then that one job can stage all the inputs to local disk, have everything write outputs to local disk, and then tar that set of results up and send it out to /data all in one go. – Jonathan Dursi Apr 11 '11 at 21:16
No, with embarassingly parallel jobs you don't request a number of cores - you just say I want to run 1000 jobs, and as soon as a single core frees up the batch system will start one of those jobs. So you may end up with 1000 running at once or just a few depending on what other jobs are running/queued. We do have the source code for the version we are using, but since it's externally developed we'd have to ask them to accept our changes. – Joseph Earl Apr 11 '11 at 21:23
http://adlibitum.oat.ts.astro.it/silva/grasil/download.htm - but as I say it is quite old. – Joseph Earl Apr 11 '11 at 21:24
Right, but what do they do for _non_ embarrasingly parallel jobs? Presumably you can get multiples of entire nodes that way; and if that's the best way to batch your serial jobs, so be it. We basically require serial job users to do it that way. If it improves file system performance, you'd think the sysadmins would be happy. At the URL provided, where is the source? I've pulled down all the .tgz files and just get data files and a precompiled executable, no source to be found. – Jonathan Dursi Apr 11 '11 at 21:38
Aah my mistake, apologies, the source code is *not* public then. Yes I will ask the sysadmin tomorrow about the behaviour and if we can change it so there is a limit on the maximum number running at one time. Then look putting a group of jobs on an entire node, and combining their output into one transfer back to the shared file-system. – Joseph Earl Apr 11 '11 at 21:43

score 2 · Answer 2 · answered Apr 11 '11 at 20:11

It is hard to decide if you don't know what exactly causes the crash. If you think it is an error related to the filesystem performance, you can try an distributed filesystem: http://hadoop.apache.org/common/docs/r0.20.0/hdfs_user_guide.html

If you want to implement Master/Slave system, maybe Hadoop can be the answer.

But first of all I would try to find out what causes the crash...

Ira Baxter · Answer 3 · 2011-04-11T22:20:53.057

OSes don't alway behave nicely when they run out of resources; sometimes they simply abort the process that asks for the first unit of resource the OS can't provide. Many OSes have file handle resource limits (Windows I think has a several-thousand handle resource, which you can bump up against in circumstances like yours), and failure to find a free handle usually means the OS does bad things to the requesting process.

One simple solution requiring a program change, is to agree that no more than N of your many jobs can be writing at once. You'll need a shared semaphore that all jobs can see; most OSes will provide you with facilities for one, often as a named resource (!). Initialize the semaphore to N before you launch any job. Have each writing job acquire a resource unit from the semaphore when the job is about to write, and release that resource unit when it is done. The amount of code to accomplish this should be a handful of lines inserted once into your highly parallel application. Then you tune N until you no longer have the problem. N==1 will surely solve it, and you can presumably do lots better than that.

What is the best way to avoid overloading a parallel file-system when running embarrassingly parallel jobs?

3 Answers3