Skip to main content

Overview of Slurm

Slurm is a type of program known as a job scheduler or resource scheduler, which automatically allocates computing resources (such as CPU cores or memory) to each user in environments utilized by numerous users.

Slurm (Simple Linux Utility for Resource Management) is a powerful job scheduler designed for Linux, originally developed at Lawrence Livermore National Laboratory (LLNL). It is widely used, particularly in High-Performance Computing (HPC) environments. Initially created to efficiently manage large-scale parallel computing, it has been adopted by many supercomputers and research institutions. Slurm is released as open-source software and is available for free. Additionally, it is often provided as a package for major Linux distributions, making it easy to deploy on research lab servers.

Reference materials:

Types of Jobs

In general, job schedulers primarily use the following four types of jobs. Slurm follows this classification for explanations as well.

  • Interactive jobs
    • Used when interacting with the supercomputer.
  • Batch jobs
    • Used when running a small number of programs that use only one CPU core.
  • Parallel jobs
    • Used when running a small number of programs that use multiple CPU cores simultaneously.
  • Array jobs
    • Used when sequentially running many batch or parallel jobs.

For more details on other types of jobs, please refer to the official manual.

Other Commands

The primary commands used are as follows:

  • squeue
    • Check the current status of jobs.
  • scancel
    • Delete a job.
  • scontrol
    • Change the settings of a job.

For details, please refer to the section on other commands and the official manual.

When a Job Does Not Start

  1. Check the job settings, mainly in the following aspects:
    • Ensure the amount of computing resources requested in the job script is correct. Confirm that the description does not exceed the memory amount per node or the physical CPU core count.
    • Verify that the executable time is not requesting beyond the partition settings.
  2. Check the congestion status of the supercomputer.