Slurm (formerly spelled SLURM) is an open-source resource manager designed for Linux HPC clusters of all sizes.
Slurm: A Highly Scalable Resource Manager
Slurm is an open-source resource manager designed for Linux clusters of all sizes. It provides three key functions. First it allocates exclusive and/or non-exclusive access to resources (computer nodes) to users for some duration of time so they can perform work. Second, it provides a framework for starting, executing, and monitoring work (typically a parallel job) on a set of allocated nodes. Finally, it arbitrates contention for resources by managing a queue of pending work.
Slurm's design is very modular with dozens of optional plugins. In its simplest configuration, it can be installed and configured in a couple of minutes (see Caos NSA and Perceus: All-in-one Cluster Software Stack by Jeffrey B. Layton) and was used by Intel on their 48-core "cluster on a chip". More complex configurations can satisfy the job scheduling needs of world-class computer centers and rely upon a MySQL database for archiving accounting records, managing resource limits by user or bank account, or supporting sophisticated job prioritization algorithms.
While other resource managers do exist, Slurm is unique in several respects:
- It is designed to operate in a heterogeneous cluster counting over 100,000 nodes and millions of processors.
- It can sustain a throughput rate of hundreds of thousands jobs per hour with bursts of job submissions at several times that rate.
- Its source code is freely available under the GNU General Public License.
- It is portable; written in C and using the GNU autoconf configuration engine. While initially written for Linux, other UNIX-like operating systems should be easy porting targets.
- It is highly tolerant of system failures, including failure of the node executing its control functions.
- A plugin mechanism exists to support various interconnects, authentication mechanisms, schedulers, etc. These plugins are documented and simple enough for the motivated end user to understand the source and add functionality.
- Configurable node power control functions allow putting idle nodes into a power-save/power-down mode. This is especially useful for "elastic burst" clusters which expand dynamically to a cloud virtual machine (VM) provider to accommodate workload bursts.
Resources and Tutorials:
- Quick Start
- Tutorial, LLNL
- Elastic cloud tutorial, Google
- GitHub source
- Bug tracker/patch submission
- Wikipedia
Name Spelling
As of v18.08, the name spelling “SLURM” has been changed to “Slurm” (commit 3d7ada78e).
Other Uses of the Name
Slurm also a fictional soft drink in the Futurama multiverse where it is popular and highly addictive.