Skip to main content

Other Commands

How to Use GPUs

In the personal genome analysis section, GPU compute nodes are operated by allocating jobs on a per-node basis, without the environment setup for dividing GPUs within a node. Therefore, when submitting jobs that utilize the GPU, there is no need to explicitly specify GPU usage as an option.

Checking the Status of Job Execution (squeue)

Checking the Status of Job Submission

The squeue command can be used to check the status of job execution. For details on the options, please refer to the online manual.

Example of execution

xxxxx-pg@at022vm02:~$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
751 parabrick test.sh xxxxx-pg R 0:02 3 igt[010,015-016]
750 parabrick test.sh xxxxx-pg R 0:05 3 igt[010,015-016]
749 parabrick test.sh xxxxx-pg R 0:09 3 igt[010,015-016]
748 parabrick test.sh xxxxx-pg R 0:13 3 igt[010,015-016]

The default items displayed by squeue are as follows:

Item NameDescription
JOBIDDisplays the job ID assigned to the job.
PARTITIONDisplays the name of the partition (queue) into which the job was submitted.
NAMEDisplays the job name (if not specified, the command string is displayed).
USERDisplays the name of the user who submitted the job.
STDisplays the job status. The main job statuses are shown in the table below.
TIMEJob execution time (format: days-hh:mm:ss)
NODESNumber of nodes used for job execution
NODELIST(REASON)List of hostnames where the job is executed

Job Status Description (ST field)

Status CharacterDescription
CA (CANCELLED)The job was explicitly cancelled by a user or system administrator.
CD (COMPLETED)The job has finished executing on all nodes.
CF (CONFIGURING)The job is waiting for resources to become usable after being allocated.
CG (COMPLETING)The job is in the process of completing.
F (FAILED)The job terminated with a non-zero exit code or another failure condition.
NF (NODE_FAIL)The job terminated due to a failure of one of the allocated nodes.
PD (PENDING)The job is waiting for resource allocation. It is pending.
PR (PREEMPTED)The job terminated due to preemption.
R (RUNNING)The job is currently running.
S (SUSPENDED)The job has resource allocation (execution is suspended).
TO (TIMEOUT)The job terminated because it reached its time limit.

Checking Detailed Information of a Job (scontrol show job)

If you want to check more detailed information about your job, after confirming the job ID with the squeue command, you can check it with the scontrol show job. For details on the options, please refer to the online manual.

Example of execution

xxxxx-pg@at022vm02:~/$ scontrol show job 747
JobId=747 JobName=test
UserId=xxxxx-pg(30257) GroupId=xxxxxx-pg(30063) MCS_label=N/A
Priority=10102 Nice=0 Account=(null) QOS=normal
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
RunTime=00:00:12 TimeLimit=UNLIMITED TimeMin=N/A
SubmitTime=2024-02-19T20:57:53 EligibleTime=2024-02-19T20:57:53
AccrueTime=2024-02-19T20:57:53
StartTime=2024-02-19T20:57:53 EndTime=Unknown Deadline=N/A
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2024-02-19T20:57:53 Scheduler=Main
Partition=parabricks AllocNode:Sid=at022vm02:1768472
ReqNode

List=(null) ExcNodeList=(null)
NodeList=igt010
BatchHost=igt010
NumNodes=1 NumCPUs=4 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
TRES=cpu=4,mem=375G,node=1,billing=4
Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
MinCPUsNode=4 MinMemoryNode=375G MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
Command=./test.sh
WorkDir=/lustre8/home/xxxxx-pg/parabricks
StdErr=/lustre8/home/xxxxx-pg/parabricks/res.txt
StdIn=/dev/null
StdOut=/lustre8/home/xxxxx-pg/parabricks/res.txt
Power=
TresPerNode=gres:gpu:4

Checking the Overall Congestion of the Cluster (sinfo, squeue)

For checking the status of partitions, primarily use sinfo, and to refer to the status of jobs submitted to partitions, use the squeue command. Please refer to the respective online manuals for more details.

sinfo -s (-s for summary display)

Displays a summary for each partition, without distinguishing by the state of nodes.

yxxxx-pg@at022vm02:~/parabricks$ sinfo -s
PARTITION AVAIL TIMELIMIT NODES(A/I/O/T) NODELIST
igt009 up infinite 0/0/1/1 igt009
igt010 up infinite 1/0/0/1 igt010
igt015 up infinite 0/1/0/1 igt015
igt016 up infinite 0/1/0/1 igt016
parabricks* up infinite 1/2/1/4 igt[009-010,015-016]

The display in the NODES column has the following meanings:

  • A (Allocated): Number of nodes that are allocated (in use).
  • I (Idle): Number of idle (not in use) nodes. Nodes available for use are classified here.
  • O (Other): Number of nodes in other states. This category includes nodes in the drain (unusable for maintenance), down (unusable due to failure), or other specific states recognized by Slurm.
  • T (Total): Total number of nodes in the partition. This is the sum of nodes in Allocated, Idle, and Other states.

Without -s, nodes are displayed separated by their states as follows:

yxxxx-pg@at022vm02:~/parabricks$ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
igt009 up infinite 1 down* igt009
igt010 up infinite 1 mix igt010
igt015 up infinite 1 idle igt015
igt016 up infinite 1 idle igt016
parabricks* up infinite 1 down* igt009
parabricks* up infinite 1 mix igt010
parabricks* up infinite 2 idle igt[015-016]

The partition is recognized as UP, but the compute nodes included in the partition are down, etc. The state mix indicates that not all cores on the compute node are in use.

showpartitions

If you want a more summarized display of partition states, use showpartitions, which is not a standard command of Slurm but an open-source command by another author.

yxxxx-pg@at022vm02:~/parabricks$ showpartitions 
Partition statistics for cluster linux at Thu Feb 22 10:45:21 AM JST 2024
Partition #Nodes #CPU_cores Cores_pending Job_Nodes MaxJobTime Cores Mem/Node
Name State Total Idle Total Idle Resorc Other Min Max Day-hr:mn /node (GB)
igt009 up 1 0 48 0 0 0 1 infin infinite 48 386
igt010 up 1 0 48 44 0 0 1 infin infinite 48 386
igt015 up 1

1 48 48 0 0 1 infin infinite 48 386
igt016 up 1 1 48 48 0 0 1 infin infinite 48 386
parabricks:* up 4 2 192 140 0 0 1 infin infinite 48 386
Note: The cluster default partition name is indicated by :*

The number of nodes available in the partition is displayed in the #Nodes column as idle. The number of CPU cores available in the partition is displayed in the #CPU_Cores column as Idle. Please refer to this to check the utilization status of the partition.

squeue

To check the status of user jobs submitted to the partition, use the squeue command. Currently, the submission status of jobs between users using the same Slurm system is visible to each other, but if you explicitly want to display job information for all users and all partitions, execute with the -a option.

xxxxx-pg@at022vm02:~$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
646 parabrick test.sh xxxxx-pg R 0:02 3 igt[010,015-016]
647 parabrick test.sh xxxxx-pg R 0:02 3 igt[010,015-016]
645 parabrick test.sh xxxxx-pg R 0:05 3 igt[010,015-016]
648 parabrick test_gpu tkxxx R 0:02 1 igt010
644 parabrick test_gpu tkxxx R 0:12 1 igt010

pestat

If you want to check the operation status of the compute nodes where jobs have been submitted from the perspective of compute nodes, use pestat, an open-source tool. It displays a list of jobs submitted to each node, as well as the number of CPU cores used and the amount of free memory on the node.

xxxx-pg@at022vm02:~/parabricks$ pestat 
Hostname Partition Node Num_CPU CPUload Memsize Freemem Joblist
State Use/Tot (15min) (MB) (MB) JobID User ...
igt009 igt009 down* 0 48 0.03 386452 366715
igt010 parabricks mix 4 48 0.33* 386462 377722 844 xxxx-pg
igt015 igt015 idle 0 48 0.01 386458 380861
igt016 igt016 idle 0 48 0.10 386462 379586

How to Check the Reason Why a Job is Not Starting

To refer to the reason why a job is not starting and is in a waiting state, use the squeue command. The "NODELIST (REASON)" column displays the Reason Code.

The full list of Reason Codes can be found on the following developer's site:

If a reason other than waiting for some resource is displayed and the job's execution is being delayed for a long time without any status change, please contact the help desk.

Deleting a Job (scancel)

If the processing is taking longer than expected, or if the intermediate results output to the job's standard output and standard error output indicate that the job is not performing as expected, delete the job with the scancel command.

Example of execution

xxxxx-pg@at022vm02:~/parabricks$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
747 parabrick test xxxxx-pg R 3:01 1 igt010
xxxxx-pg@at022vm02:~/parabricks$ scancel 747
xxxxx-pg@at022vm02:~/parabricks$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
747 parabrick test xxxxx-pg CG 3:06 1 igt010
xxxxx-pg@at022vm02:~/parabricks$

squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)

If you want to cancel multiple jobs at once, the following specification methods are available:

scancel option specificationDescription
-u usernameDelete jobs by specifying a username.
-p partition nameSpecify a partition name to delete jobs for that partition targeted at the scancel execution user.
-t job stateCancel jobs by specifying the job state. Specify one of PENDING, RUNNING, SUSPENDED.

For details on the options, please refer to the online manual.

Changing Job Execution Conditions (scontrol update)

To change the amount of resources requested by a job after it has been submitted in Slurm, use scontrol update.

For more details, refer to the online manual.

Changing the Number of CPUs for a Job

scontrol update JobId=<Job ID> NumCPUs=<new CPU number>

Example of execution

Changing the Memory Amount for a Job

scontrol update JobId=<Job ID> MinMemoryNode=<new memory amount (per node)>

Memory amount is specified in MB units.

Changing the Execution Time for a Job

scontrol update JobId=<Job ID> TimeLimit=<new time limit>

Here, the new time limit is specified in the format days-hours:minutes:seconds, or simply in minutes.

Example of execution

xxxxx-pg@at022vm02:~$ sbatch ./test.sh 
Submitted batch job 779
xxxxx-pg@at022vm02:~$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
779 parabrick test.sh xxxxx-pg PD 0:00 3 (BeginTime)
xxxxx-pg@at022vm02:~$ scontrol show job 779
JobId=779 JobName=test.sh
UserId=xxxxx-pg(30257) GroupId=co-ddn-pg(30063) MCS_label=N/A
Priority=10150 Nice=0 Account=(null) QOS=normal
JobState=PENDING Reason=BeginTime Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
RunTime=00:00:00 TimeLimit=UNLIMITED TimeMin=N/A
SubmitTime=2024-02-21T11:38:08 EligibleTime=2024-02-27T00:00:00
AccrueTime=2024-02-21T11:38:08
StartTime=2024-02-27T00:00:00 EndTime=Unknown Deadline=N/A
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2024-02-21T11:38:08 Scheduler=Main
Partition=parabricks AllocNode:Sid=at022vm02:1918834
ReqNodeList=(null) ExcNodeList=(null)
NodeList=(null)
NumNodes=3-5 NumCPUs=20 NumTasks=20 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
TRES=cpu=20,mem=80G,node=1,billing=20
Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
MinCPUsNode=1 MinMemoryCPU=4G MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
Command=./test.sh
WorkDir=/lustre8/home/xxxxx-pg
StdErr=/lustre8/home/xxxxx-pg/slurm-779.out
StdIn=/dev/null
StdOut=/lustre8/home/xxxxx-pg/slurm-779.out
Power=
SpreadJob=Yes


xxxxx-pg@at022vm02:~$ scontrol update JobID=779 NumCPUs=24
xxxxx-pg@at022vm02:~$ scontrol update JobID=779 MinMemoryNode=20000
xxxxx-pg@at022vm02:~$ scontrol update JobID=779 TimeLimit=100
xxxxx-pg@at022vm02:~$ scontrol show job 779
JobId=779 JobName=test.sh
UserId=xxxxx-pg(30257) GroupId=co-ddn

-pg(30063) MCS_label=N/A
Priority=10160 Nice=0 Account=(null) QOS=normal
JobState=PENDING Reason=BeginTime Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
RunTime=00:00:00 TimeLimit=01:40:00 TimeMin=N/A
SubmitTime=2024-02-21T11:38:08 EligibleTime=2024-02-27T00:00:00
AccrueTime=2024-02-21T11:38:08
StartTime=2024-02-27T00:00:00 EndTime=2024-02-27T01:40:00 Deadline=N/A
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2024-02-21T11:38:08 Scheduler=Main
Partition=parabricks AllocNode:Sid=at022vm02:1918834
ReqNodeList=(null) ExcNodeList=(null)
NodeList=(null)
NumNodes=1-5 NumCPUs=24 NumTasks=20 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
TRES=cpu=24,mem=20000M,node=1,billing=24
Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
MinCPUsNode=1 MinMemoryNode=20000M MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
Command=./test.sh
WorkDir=/lustre8/home/xxxxx-pg
StdErr=/lustre8/home/xxxxx-pg/slurm-779.out
StdIn=/dev/null
StdOut=/lustre8/home/xxxxx-pg/slurm-779.out
Power=
SpreadJob=Yes


xxxxx-pg@at022vm02:~$

Other Resource Changes

You can also change various other resource parameters, such as the number of nodes assigned to a job (NumNodes) or allocation to specific nodes (NodeList):

scontrol update JobId=<Job ID> NumNodes=<new number of nodes>
scontrol update JobId=<Job ID> NodeList=<new node list>

These changes are limited to when the job has not yet started (status is PENDING). Once a job enters the RUNNING state, changes to resources are restricted or may become impossible. Additionally, system settings or policies may prohibit changes to certain resources.

Checking the Results of Job Execution (sacct)

The results of a job are output to a file named `slurm-job name.out for the job's standard output if no specific command line options are provided with sbatch.

The sacct command can be used to check the execution history of a job. For details on the options, please refer to the online manual.

xxxxx-pg@at022vm02:~/parabricks$ sacct
JobID JobName Partition Account AllocCPUS State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
722 bash parabricks 1 COMPLETED 0:0
723 bash parabricks 1 FAILED 2:0
724 bash parabricks 1 COMPLETED 0:0
725 bash parabricks 1 CANCELLED+ 0:0
726 test parabricks 4 COMPLETED 0:0
726.batch batch 4 COMPLETED 0:0
727 test parabricks 4 COMPLETED 0:0
727.batch batch 4 COMPLETED 0:0
728 bash parabricks 1 COMPLETED 0:0
744_1 arraytest+ parabricks 1 COMPLETED 0:0
744_1.batch batch 1 COMPLETED 0:0
744_3 arraytest+ parabricks 1 COMPLETED 0:0
744_3.batch batch 1 COMPLETED 0:0
744_5 arraytest+ parabricks 1 COMPLETED 0:0
744_5.batch batch 1 COMPLETED 0:0
747 test parabricks 4 CANCELLED+ 0:0


747.batch batch 4 CANCELLED 0:15

Setup of Environment Variables etc., Necessary for Slurm Execution

Normally, there is no need for users to set up anything as it is automatically configured upon logging into the Slurm login node for the personal genome analysis section.