Site Policy
2017 年 04 月 24 日

Basic procedures for using the Supercomputer System

Our Supercomputer System comprises the hardware described in hardware configuration. The hardware system environment utilizes the Univa Grid Engine (UGE) job management system so that multiple users can efficiently share it. The environment comprises the following components:

Gateway node (gw.ddbj.nig.ac.jp)

This is the node used to connect to the system from the internet. To utilize the system, a user must first connect to this node from outside the system.

Login node

This is the node through which users develop programs or enter jobs into the job management system. Users login to this node from the gateway node using the qlogin command. There are multiple login units within the system, and each user is assigned to the login node with the lowest load when they login.

Compute node

This is the node through which jobs entered by the users and managed under the job management system are executed. There are Thin compute nodes, Medium compute nodes, and Fat compute nodes, according to the type of computer.

Queue

A queue is a concept in UGE in which compute nodes are grouped logically. When UGE is instructed to execute a job using a queue, computation is carried out automatically on a compute node so that the specified conditions are satisfied. There are multiple types of queues; these will be described later. The specific procedures for using these queues are provided in a subsequent section.

A schematic drawing of system utilization that shows how these components are connected is provided below:

sys01 e
The basic steps in the procedure for using the system are as follows:

Connecting to the system

First, connect to the gateway node (gw.ddbj.nig.ac.jp) via ssh. Please prepare ssh client software that can be used by the terminal at hand and establish a connection. The authorization method is password authorization. Please note that the gateway node is the entrance to the system from outside, and therefore no computations or programs can be executed on this node, nor are there any suitable environmental settings. Once you are connected to the gateway node, login to a login node by following the procedure below:

Logging in to a login node

Use qlogin, a UGE command, as follows:

 [username@gw ~]$ qlogin
Your job 6896426 ("QLOGIN") has been submitted
waiting for interactive job to be scheduled ...
Your interactive job 6896426 has been successfully scheduled.
Establishing /home/geadmin/UGER/utilbin/lx-amd64/qlogin_wrapper session to host t266i

On being executed, qlogin checks the load conditions of multiple login nodes, automatically selects the node with the lowest load, and logs into it. (In the above example, the login session itself is recognized as an interactive job (JobID6896426) and the compute node, t266, is selected for the login process.) On executing the qstat command (described later) on the logged in node, information about current jobs under UGE management in the queue can be viewed as follows:

 [username@gw ~]$ qstat
job-ID  prior   name       user         state submit/start at     queue
                slots ja-task-ID
---------------------------------------------------------------------------------
--------------------------------
6896426 0.00000 QLOGIN     username     r     06/14/2012 13:27:53 login.q@t261i
                   1

Various development environments and scripting language environments have been preinstalled on the login nodes. To conduct development activities, please work on the login nodes. For information about the programming environments on the login nodes, please refer to programming environments. Do not execute large-scale computations on the login nodes. Please be sure to execute computations on the compute node by entering the computation job via UGE.

The login nodes equipped with GPGPU have been prepared in advance to execute program development utilizing the GPGPU development environment. To login to the corresponding node, specify gpu with the -I option when executing qlogin on the gateway node as follows:

  qlogin -l gpu

This allows you to login to a login node equipped with GPGPU. The procedure and relevant steps to enter jobs on the compute nodes are as follows:

Entering computation jobs

To enter computation jobs on the compute nodes, the qsub command should be used. In order to enter jobs using the qsub command, a job script needs to be prepared and used. A simple descriptive example of such a script is shown below:

#!/bin/sh
#$ -S /bin/sh
pwd
hostname
date
sleep 20
date
echo “to stderr” 1>&2

The line specified with “#$” at the top sets the option for UGE. By setting the option instruction line as a shell script or by specifying it as an option for executing the qsub command, the mode of operation is communicated to UGE. The major options are as follows:

Description of instruction lineCommand line option instructionMeaning of instruction
#$ -S interpreter path -S interpreter path Specifies the path for the command interpreter. It is also possible to specify script languages other than shell. This option does not have to be specified.
#$ -cwd -cwd Specifies the current working directory for job execution. This will output the standard job outputs and error outputs to cwd. If this is not specified, the home directory will be the current working directory used for job execution.
#$ -N job name -N job name Specifies the name of the job. The script name is used as the job name if this is not specified.
#$ -o file path -o file path Specifies the output destination for standard output for jobs.
#$ -e file path -e file path Specifies the output destination for standard error output for jobs.


Many other options in addition to the ones listed above can be specified for qsub. For details, please use “man qsub” to access the online manual and check for other qsub commands after logging in to the system.

Queue selection when entering jobs

As described in the software configuration section, this system has queues that are set up as follows (as of April 2012). Queues can be selected using the -l option with the qsub command. To enter a job by specifying the queues, execute qsub by specifying one of the “queue specification options” in the table below. If nothing is specified, the job will be entered in week_hdd.q.

Queue nameQueue name Number of job slotsUpper limit for execution timePurposeOptions for queue specification
week_hdd.q 1600 14 days When there is no resource request in two weeks of the execution period No specification (default)
week_ssd.q 832 14 days Used when use of SSD is desired -l ssd
month_hdd.q 96 62 days For long-term jobs -l month
month_ssd.q 64 62 days For long-term use and SSD use -l month -l ssd
month_gpu.q 992 62 days Use of GPU -l month -l gpu
month_medium.q 160 62 days Use of Medium compute node -l month -l medium
month_fat.q 768 62 days Use of Fat compute node -l month -l fat
debug.q 64 1 day For debugging and operation check -l debug
login.q 192 Unlimited Used to execute qlogin from the gateway node

For example, enter the command below to enter a job called test.sh in month_ssd.q.

    qsub -l month -l ssd test.sh

Additionally, to enter a job called test.sh on the Medium compute node, enter the following command:

    qsub -l month -l medium test.sh

To enter a job called test.sh in the Fat compute node, enter the following command:

    qsub -l month -l fat test.sh

However, nodes that are equipped with GPGPU are also equipped with SSD, and nodes that are equipped with SSD are also equipped with HDD. As a result, the system is configured such that jobs flow to the GPU queue if the queue slot for using SSD is full. Further, jobs entered while the HDD queue is full will be entered in the SSD queue. Please note this point.

Checking the status of a job entered

Whether the job entered by qsub was actually entered as a job can be checked using the qstat command. The qstat command is used to check the status of entered jobs. If a number of jobs have been entered, for instance, qstat will give an output similar to the following:

qstat
job-ID  prior   name       user         state submit/start at     queue
               slots ja-task-ID
----------------------------------------------------------------------------------
-------------------------------
6929724 0.00000 jobname username      r     06/18/2012 13:00:37 week_hdd.q@t274i
                 1
6929726 0.00000 jobname username      r     06/18/2012 13:00:37 week_hdd.q@t274i
                 1
6929729 0.00000 jobname username      r     06/18/2012 13:00:37 week_hdd.q@t287i
                 1
6929730 0.00000 jobname username      r     06/18/2012 13:00:37 week_hdd.q@t250i
                 1

The meanings of the characters in the “state” column in this case are as follows:

CharacterMeaning
r Job being executed
qw Job in standby queue
t Job being transferred to the execution host
E An error occurred while processing the job
d Job in the process of deletion

To check the queue utilization status, input “qstat –f”. This gives the following output:

[username@t266 pgi]$ qstat -f
queuename                      qtype resv/used/tot. load_avg arch          states
---------------------------------------------------------------------------------
debug.q@t139i                  BP    0/0/16         0.44     lx-amd64
---------------------------------------------------------------------------------
debug.q@t253i                  BP    0/0/16         0.49     lx-amd64
---------------------------------------------------------------------------------
debug.q@t254i                  BP    0/0/16         0.43     lx-amd64
---------------------------------------------------------------------------------
debug.q@t255i                  BP    0/0/16         0.44     lx-amd64
---------------------------------------------------------------------------------
debug.q@t256i                  BP    0/0/16         0.41     lx-amd64
(omitted)
---------------------------------------------------------------------------------
week_hdd.q@t267i               BP    0/16/16        16.65    lx-amd64      a
6901296 0.00000 job_name   username     r     06/17/2012 22:20:32    16
---------------------------------------------------------------------------------
week_hdd.q@t268i               BP    0/0/16         0.50     lx-amd64
---------------------------------------------------------------------------------
week_hdd.q@t269i               BP    0/0/16         0.57     lx-amd64
---------------------------------------------------------------------------------
(omitted)
week_hdd.q@t280i               BP    0/6/16         2.18     lx-amd64
---------------------------------------------------------------------------------
week_hdd.q@t281i               BP    0/12/16        12.79    lx-amd64
6901296 0.00000 job_name   username    r     06/17/2012 22:20:32    12
---------------------------------------------------------------------------------
week_hdd.q@t282i               BP    0/6/16         1.34     lx-amd64
---------------------------------------------------------------------------------
week_hdd.q@t283i               BP    0/16/16        16.63    lx-amd64      a
6901296 0.00000 job_name   username     r     06/17/2012 22:20:32    16
---------------------------------------------------------------------------------
week_hdd.q@t284i               BP    0/16/16        16.59    lx-amd64      a
6901296 0.00000 job_name   username     r     06/17/2012 22:20:32    16
---------------------------------------------------------------------------------
(omitted)

This output can be used to determine the node (queue) on which the job is entered. To view the overall state for each queue such as job entering status and queue load status, “qstat -g c” can be used.

[username@t266 pgi]$ qstat -g c
CLUSTER QUEUE                   CQLOAD   USED    RES  AVAIL  TOTAL aoACDS  cdsuE
--------------------------------------------------------------------------------
debug.q                           0.03      0      0    128    128      0      0
login.q                           1.00    122      0     70    192      0      0
month_fat.q                       0.47    450      0    318    768      0      0
month_gpu.q                       0.03      0      0    976    992      0     16
month_hdd.q                       0.04      1      0     95     96      0      0
month_medium.q                    0.24     32      0    128    160      0      0
month_ssd.q                       0.03      0      0     32     32      0      0
web_month.q                       0.03      0      0     16     16      0      0
web_week.q                        0.03      0      0     16     16      0      0
week_hdd.q                        0.13    222      0    914   1136     48      0
week_ssd.q                        0.33    256      0    608    864    256      0

Detailed information on a job can be obtained by specifying “qstat -j jobID.”

[username@t266 pgi]$ qstat -j 6901165
==============================================================
job_number:                 6901165
exec_file:                  job_scripts/6901165
submission_time:            Sun Jun 17 22:12:36 2012
owner:                      username
uid:                        XXXX
group:                      usergroup
gid:                        XXXX
sge_o_home:                 /home/username
sge_o_log_name:             username
sge_o_path:                 /home/geadmin/UGER/bin/lx-amd64:/usr/lib64/qt-3.3/bin:/
opt/pgi/linux86-64/current/bin:/usr/local/pkg/java/current/bin:/opt/intel/composer
_xe_2011_sp1.6.233/bin/intel64:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/
sbin:/sbin:/opt/bin:/opt/intel/composer_xe_2011_sp1.6.233/mpirt/bin/intel64:/opt/
intel/itac/8.0.3.007/bin:/home/username/bin
sge_o_shell:                /bin/bash
sge_o_workdir:              /lustre1/home/username/workdir
sge_o_host:                 t352i
account:                    sge
cwd:                        /home/username
hard resource_list:         mem_req=4G,s_stack=10240K,s_vmem=4G,ssd=TRUE
mail_list:                  username@t352i
notify:                     FALSE
job_name:                   job_name
jobshare:                   0
shell_list:                 NONE:/bin/sh
env_list:
script_file:                userscript
parallel environment:  mpi-fillup range: 128
binding:                    NONE
usage    1:                 cpu=83:14:55:50, mem=2934667.66727 GBs, io=88.79956,
vmem=57.457G, maxvmem=56.291G
binding    1:               NONE
scheduling info:            (Collecting of scheduler job information is turned off)

To immediately delete a job without waiting for job completion when the job execution status is checked and it is found that the job status is incorrect, the qdel command can be used. Specify “qdel job ID.” To delete all the jobs entered by a specific user, specify “qdel -u username.”

Checking the results

The results of a job are output as standard job output in a file called jobname.o job ID, and standard job error output in a file called jobname.e job ID. Please check the files. Detailed information on how much resources an executed job used and so forth can be checked using the qacct command.

qacct -j 1996
==============================================================
qname        week_ssd.q
hostname     t046i
group        usergroup
owner        username
project      NONE
department   defaultdepartment
jobname      jobscript.sh
jobnumber    XXXX
taskid       undefined
account      sge
priority     0
qsub_time    Wed Mar 21 12:35:43 2012
start_time   Wed Mar 21 12:35:47 2012
end_time     Wed Mar 21 12:45:45 2012
granted_pe   NONE
slots        1
failed       0
exit_status  0
ru_wallclock 598
ru_utime     115.199
ru_stime     482.510
ru_maxrss    427756
ru_ixrss     0
ru_ismrss    0
ru_idrss     0
ru_isrss     0
ru_minflt    27853
ru_majflt    44
ru_nswap     0
ru_inblock   2904
ru_oublock   2136
ru_msgsnd    0
ru_msgrcv    0
ru_nsignals  0
ru_nvcsw     23559
ru_nivcsw    6022
cpu          598.520
mem          17179871429.678
io           0.167
iow          0.000
maxvmem      3.929G
arid         undefined

 

How to use high-speed domain (Lustre domain)

Lustre configuration

In this supercomputer system, the user home directory is constructed on the file system comprising the Lustre File System. The Lustre File System is a parallel file system that is mainly used by large-scale supercomputer systems. Two units of MDS (described later), 12 units of OSS (described later), and 72 sets of OST (described later) constitute one unit of the Lustre File System. Two of these file systems were introduced into this system (as of March 2012. To be expanded in fiscal year 2014.).

Lustre01

Lustre components

Simply put, a Lustre File System comprises an IO server called the Object Storage Server (OSS), a disk device called the Object Storage Target (OST), and a server called the MetaData Server (MDS) that manages the file metadata as components. A description of each component is provided in the table below:

ComponentDescription of term
Object Storage Server (OSS) Manages OST (described below) and controls the IO requests from the network to OST.
Object Storage Target (OST) This is a block storage device that stores the file data. One OST is considered one virtual disk, and comprises multiple physical disks. The user data is stored in one or more OSTs as one or more objects. It is possible to change the set number of objects per file, and the storage performance can be adjusted by tuning this. In this system configuration, eight OSTs are managed for one unit of OSS.
Meta Data Server (MDS) MDS comprises one server unit (two units with HA configuration in this system) in one Lustre file system, and it manages the position data on objects to which files are assigned and the file attributes data within the file system and guides the file IO requests to the proper OSS. When file IO is starts initially, MDS does not become related to the file IO, and data transmission is conducted directly between the client and OSS. Thus, Lustre File System can have high IO performance.
Meta Data Target (MDT) This is the storage device used to store the metadata (file name, directory, ACL, etc.) for the files in the Lustre File System. It is connected to MDS.

File striping

One of the characteristics of Lustre is its ability to divide one file into multiple segments and store them by dispersing them over multiple OSTs. This function is called file striping. The advantages of file striping is the capability to execute read/write in parallel from the client as one file is stored in multiple OSTs as multiple segments and to read/write large files at high speeds. However, there are also disadvantages to file striping. One disadvantage is that the overhead for handling the dispersed data increases as the file is dispersed over multiple OSTs. Therefore, only when the target file is several GBs or greater is stripe size and stripe count considered effective in general. Since Lustre centrally manages the metadata at the MDS, file operation that concurs with metadata operation (preparation of ls -l or many small-sized files, etc.) concentrates the load on MDS and thus the speed is not very high compared to the equivalent operations on the local file system. Please note this point and avoid operations that place tens of thousands of small-size files in the same directory and so forth (it is better to store them in multiple directories in this case).

Checking the usage status of the home directory

The usage status of the user’s current home directory can be checked using the “lfs quota” command.

 [munakata@t260 ~]$ lfs quota -u username ./
Disk quotas for user username (uid 3055):
     Filesystem  kbytes   quota   limit   grace   files   quota   limit   grace
             ./ 792362704       0 1000000000       -   10935       0       0       -
Disk quotas for group tt-gsfs (gid 10006):
     Filesystem  kbytes   quota   limit   grace   files   quota   limit   grace
             ./ 8250574892       0       0       - 5684904       0       0       -
Item Meaning/description
kbyte File capacity during use (KB)
quota Limitation value for file capacity/number (soft limit)
limit Absolute limitation value for file capacity/number (hard limit)
grace Tolerable period for exceeding the limitation value
files Number of files in use

 

How to set up file striping

To set up file striping, please follow the procedure below. First, check the current stripe count. It can be checked using “lfs getstrip subject file (directory).” (The system default is set to one.)

 [munakata@t260 ~]$ lfs quota -u username ./
Disk quotas for user username (uid 3055):
     Filesystem  kbytes   quota   limit   grace   files   quota   limit   grace
             ./ 792362704       0 1000000000       -   10935       0       0       -
Disk quotas for group tt-gsfs (gid 10006):
     Filesystem  kbytes   quota   limit   grace   files   quota   limit   grace
             ./ 8250574892       0       0       - 5684904       0       0       -
Option name Description
-c Sets up the stripe number and stripe width.
-o Specifies the offset.
-s Specifies the striping size.

 

User data backup

A directory called ./backup is prepared in the user’s home directory by default, and differential backup is executed once a day by the system for the groups of files placed within the directory. Place data that you wish to backup under home directory/backup.

The procedure to restore the backup data is as follows:

First, check the backup acquisition status, and then input the following command:

 [username@t261 ~]$ rdiff-backup -l /backup/username/
Found 0 increments:
Current mirror: Tue Jun 26 08:16:57 2012

Look at the date of the portion of the “Current mirror” above, and specify to restore by using the W3C date format (YYYY-MM-DDTHH:SS:MM).In this example, you can restore files under a “./restore” directory.

[username@t261 ~]$ rdiff-backup --restore-as-of 2012-06-26T08:16:57 /backup/munakata/ ./restore

How to use X client on the login node

To utilize X-Window Client on the login node, please prepare the corresponding X-Window server emulator if you are using Windows. For Mac, the X11 SDK is included in Xcode Tools. Install this so that X11 can be used.