Site Policy
2017 年 12 月 12 日

Basic procedures for using the Supercomputer System

 

This page explains the basic usage of this supercomputer system.Contents will be duplicated, but also the materials distributed at the briefing session (workshop) will be posted below,so please refer to them together.

Document TitleJapanese editionEnglish editionOutline explanation
Explanatory meeting for users of new supercomputer system (Introduction to the system) from here from here Outline description of system
Explanatory meeting for users of supercomputer system
--Basic usage --
from here from here Explain the outline of the system configuration and how to use the commands necessary for basic use of the system
Explanatory meeting for users of supercomputer system
--Overview of UGE --
from here from here Explain basic usage of Univa Grid Engine necessary for system use
Explanatory meeting forusers of supercomputer system
--Knowhow for entering jobs in UGE --
from here from here Explain basic usage of Univa Grid Engine necessary for system use
Explanatory meeting for users of supercomputer system
-- User registration --
from here from here  

 

Our Supercomputer System comprises the hardware described in hardware configuration. The hardware system environment utilizes the Univa Grid Engine (UGE) job management system so that multiple users can efficiently share it. The environment comprises the following components:

Gateway node (gw.ddbj.nig.ac.jp)

This is the node used to connect to the system from the internet. To utilize the system, a user must first connect to this node from outside the system.

Login node

This is the node through which users develop programs or enter jobs into the job management system. Users login to this node from the gateway node using the qlogin command. There are multiple login units within the system, and each user is assigned to the login node with the lowest load when they login.

Compute node

This is the node through which jobs entered by the users and managed under the job management system are executed. There are Thin compute nodes, Medium compute nodes, and Fat compute nodes, according to the type of computer.

Queue

A queue is a concept in UGE in which compute nodes are grouped logically. When UGE is instructed to execute a job using a queue, computation is carried out automatically on a compute node so that the specified conditions are satisfied. There are multiple types of queues; these will be described later. The specific procedures for using these queues are provided in a subsequent section.

A schematic drawing of system utilization that shows how these components are connected is provided below:
sys01_e1.png
 
The basic steps in the procedure for using the system are as follows:

Connecting to the system

First, connect to the gateway node (gw.ddbj.nig.ac.jp) via ssh. Please prepare ssh client software that can be used by the terminal at hand and establish a connection. The authorization method is password authorization.

 ・Phase1 system gateway ⇒ gw.ddbj.nig.ac.jp

 ・Phase2 system gateway ⇒ gw2.ddbj.nig.ac.jp

Please note that the gateway node is the entrance to the system from outside, and therefore no computations or programs can be executed on this node, nor are there any suitable environmental settings. Once you are connected to the gateway node, login to a login node by following the procedure below:

Logging in to a login node

Use qlogin, a UGE command, as follows:

 [username@gw ~]$ qlogin
Your job 6896426 ("QLOGIN") has been submitted
waiting for interactive job to be scheduled ...
Your interactive job 6896426 has been successfully scheduled.
Establishing /home/geadmin/UGER/utilbin/lx-amd64/qlogin_wrapper session to host t266i

On being executed, qlogin checks the load conditions of multiple login nodes, automatically selects the node with the lowest load, and logs into it. (In the above example, the login session itself is recognized as an interactive job (JobID6896426) and the compute node, t266, is selected for the login process.) On executing the qstat command (described later) on the logged in node, information about current jobs under UGE management in the queue can be viewed as follows:

 [username@gw ~]$ qstat
job-ID  prior   name       user         state submit/start at     queue
                slots ja-task-ID
---------------------------------------------------------------------------------
--------------------------------
6896426 0.00000 QLOGIN     username     r     06/14/2012 13:27:53 login.q@t261i
                   1

Various development environments and scripting language environments have been preinstalled on the login nodes. To conduct development activities, please work on the login nodes. For information about the programming environments on the login nodes, please refer to programming environments. Do not execute large-scale computations on the login nodes. Please be sure to execute computations on the compute node by entering the computation job via UGE.

The login nodes equipped with GPGPU have been prepared in advance to execute program development utilizing the GPGPU development environment. To login to the corresponding node, specify gpu with the -I option when executing qlogin on the gateway node as follows:

  qlogin -l gpu

This allows you to login to a login node equipped with GPGPU.( Common to Phase1 and Phase2)

 

In order to develop programs using the Xeon Phi Coprocessor development environment,we have a login node with Xeon Phi Coprocessor installed.If you want to log in to this node, when you do qlogin on the gateway node

    qlogin -l phi    

Please specify the -l option to phi. You can log in to the login node using Xeon Phi Coprocessor.( For Phase2 only)

The procedure and relevant steps to enter jobs on the compute nodes are as follows:

Entering computation jobs

To enter computation jobs on the compute nodes, the qsub command should be used. In order to enter jobs using the qsub command, a job script needs to be prepared and used. A simple descriptive example of such a script is shown below:

#!/bin/sh
#$ -S /bin/sh
pwd
hostname
date
sleep 20
date
echo “to stderr” 1>&2

The line specified with “#$” at the top sets the option for UGE. By setting the option instruction line as a shell script or by specifying it as an option for executing the qsub command, the mode of operation is communicated to UGE. The major options are as follows:

Description of instruction lineCommand line option instructionMeaning of instruction
#$ -S interpreter path -S interpreter path Specifies the path for the command interpreter. It is also possible to specify script languages other than shell. This option does not have to be specified.
#$ -cwd -cwd Specifies the current working directory for job execution. This will output the standard job outputs and error outputs to cwd. If this is not specified, the home directory will be the current working directory used for job execution.
#$ -N job name -N job name Specifies the name of the job. The script name is used as the job name if this is not specified.
#$ -o file path -o file path Specifies the output destination for standard output for jobs.
#$ -e file path -e file path Specifies the output destination for standard error output for jobs.


Many other options in addition to the ones listed above can be specified for qsub. For details, please use “man qsub” to access the online manual and check for other qsub commands after logging in to the system.

Queue selection when entering jobs

As described in the software configuration section, this system has queues that are set up as follows (as of July 27, 2017). Queues can be selected using the -l option with the qsub command. To enter a job by specifying the queues, execute qsub by specifying one of the “queue specification options” in the table below. If nothing is specified, the job will be entered in month_hdd.q.
We added a short-time job queue (short.q) in December 2014.

Phase1 System

Queue nameNumber of job slotsMaximum memory capacityUpper limit for execution timePurposeOptions for queue specification
month_hdd.q 832 64G 62 days When there is no resource request in two months of the execution period No specification or
-l month (default)
month_gpu.q 248 32G 62 days Use of GPU -l gpu or -l month -l gpu
month_medium.q 160 2T 62 days Use of Medium compute node -l medium or
-l month -l medium
month_fat.q 768 10T 62 days Use of Fat compute node -l fat or -l month -l fat
debug.q 48 64G 3 days For debugging and operation check -l debug
login.q 192 64G Unlimited Used to execute qlogin from the gateway node  
short.q 744 32G 3 days For short-time jobs -l short 

  ※Please refer toFat計算ノードにジョブを投入する際のTIPSwhen submitting job to month_fat.q.

Phase2 System

Compeared with the Phase2 System, month_phi.q queue that has high parallel execution has been added.

Queue nameNumber of job slotsMaximum memory capacityUpper limit for execution timePurposeOptions for queue specification
month_hdd.q 1020 64G 62 days When there is no resource request in two months of the execution period No specification or
-l month (default)
month_ssd.q 640 64G 62 days Used when use of SSD is desired -l ssd or -l month -l ssd
month_phi.q 600 64G 62 days Enter jobs that utilize Co-processor(Xeon Phi) into this queue. -l phi or -l month -l phi
month_gpu.q 310 32G 62 days Use of GPU -l gpu or -l month -l gpu
month_medium.q 640 2T 62 days Use of Medium compute node -l medium or
-l month -l medium
debug.q 80 64G 3 days For debugging and operation check -l debug (-l gpu / -l phi)※
login.q 420 64G Unlimited Used to execute qlogin from the gateway node  
short.q 930 32G 3 days For short-time jobs -l short 

 ※Case of debug queue which has gpu ⇒ -l debug -l gpu 
  Case of debug queue which has phi  ⇒ -l debug -l phi

  ※Maximum memory capacity Maximum value that can be specified with -l mem_req

For example, enter the command below to enter a job called test.sh in month_ssd.q.

    qsub -l ssd test.sh

Additionally, to enter a job called test.sh on the Medium compute node, enter the following command:

    qsub -l medium test.sh

To enter a job called test.sh in the Fat compute node, enter the following command:
Please refer to TIPS when submitting job to month_fat.q.

    qsub -l fat test.sh

However, nodes that are equipped with GPGPU are also equipped with SSD, and nodes that are equipped with SSD are also equipped with HDD. As a result, the system is configured such that jobs flow to the GPU queue if the queue slot for using SSD is full. Further, jobs entered while the HDD queue is full will be entered in the SSD queue. Please note this point.

Checking the status of a job entered

Whether the job entered by qsub was actually entered as a job can be checked using the qstat command. The qstat command is used to check the status of entered jobs. If a number of jobs have been entered, for instance, qstat will give an output similar to the following:

qstat
job-ID  prior   name       user         state submit/start at     queue
               slots ja-task-ID
----------------------------------------------------------------------------------
-------------------------------
6929724 0.00000 jobname username      r     06/18/2012 13:00:37 month_hdd.q@t274i
                 1
6929726 0.00000 jobname username      r     06/18/2012 13:00:37 month_hdd.q@t274i
                 1
6929729 0.00000 jobname username      r     06/18/2012 13:00:37 month_hdd.q@t287i
                 1
6929730 0.00000 jobname username      r     06/18/2012 13:00:37 month_hdd.q@t250i
                 1

The meanings of the characters in the “state” column in this case are as follows:

CharacterMeaning
r Job being executed
qw Job in standby queue
t Job being transferred to the execution host
E An error occurred while processing the job
d Job in the process of deletion

To check the queue utilization status, input “qstat –f”. This gives the following output:

[username@t266 pgi]$ qstat -f
queuename                      qtype resv/used/tot. load_avg arch          states
---------------------------------------------------------------------------------
debug.q@t139i                  BP    0/0/16         0.44     lx-amd64
---------------------------------------------------------------------------------
debug.q@t253i                  BP    0/0/16         0.49     lx-amd64
---------------------------------------------------------------------------------
debug.q@t254i                  BP    0/0/16         0.43     lx-amd64
---------------------------------------------------------------------------------
debug.q@t255i                  BP    0/0/16         0.44     lx-amd64
---------------------------------------------------------------------------------
debug.q@t256i                  BP    0/0/16         0.41     lx-amd64
(omitted)
---------------------------------------------------------------------------------
month_hdd.q@t267i              BP    0/16/16        16.65    lx-amd64      a
6901296 0.00000 job_name   username     r     06/17/2012 22:20:32    16
---------------------------------------------------------------------------------
month_hdd.q@t268i              BP    0/0/16         0.50     lx-amd64
---------------------------------------------------------------------------------
month_hdd.q@t269i              BP    0/0/16         0.57     lx-amd64
---------------------------------------------------------------------------------
(omitted)
month_hdd.q@t280i              BP    0/6/16         2.18     lx-amd64
---------------------------------------------------------------------------------
month_hdd.q@t281i              BP    0/12/16        12.79    lx-amd64
6901296 0.00000 job_name   username    r     06/17/2012 22:20:32    12
---------------------------------------------------------------------------------
month_hdd.q@t282i              BP    0/6/16         1.34     lx-amd64
---------------------------------------------------------------------------------
month_hdd.q@t283i              BP    0/16/16        16.63    lx-amd64      a
6901296 0.00000 job_name   username     r     06/17/2012 22:20:32    16
---------------------------------------------------------------------------------
month_hdd.q@t284i              BP    0/16/16        16.59    lx-amd64      a
6901296 0.00000 job_name   username     r     06/17/2012 22:20:32    16
---------------------------------------------------------------------------------
(omitted)

This output can be used to determine the node (queue) on which the job is entered. To view the overall state for each queue such as job entering status and queue load status, “qstat -g c” can be used.

[username@t266 pgi]$ qstat -g c
CLUSTER QUEUE                   CQLOAD   USED    RES  AVAIL  TOTAL aoACDS  cdsuE
--------------------------------------------------------------------------------
debug.q                           0.03      0      0    128    128      0      0
login.q                           1.00    122      0     70    192      0      0
month_fat.q                       0.47    450      0    318    768      0      0
month_gpu.q                       0.03      0      0    976    992      0     16
month_hdd.q                       0.04      1      0     95     96      0      0
month_medium.q                    0.24     32      0    128    160      0      0
month_ssd.q                       0.03      0      0     32     32      0      0
web_month.q                       0.03      0      0     16     16      0      0
web_week.q                        0.03      0      0     16     16      0      0

Detailed information on a job can be obtained by specifying “qstat -j jobID.”

[username@t266 pgi]$ qstat -j 6901165
==============================================================
job_number:                 6901165
exec_file:                  job_scripts/6901165
submission_time:            Sun Jun 17 22:12:36 2012
owner:                      username
uid:                        XXXX
group:                      usergroup
gid:                        XXXX
sge_o_home:                 /home/username
sge_o_log_name:             username
sge_o_path:                 /home/geadmin/UGER/bin/lx-amd64:/usr/lib64/qt-3.3/bin:/
opt/pgi/linux86-64/current/bin:/usr/local/pkg/java/current/bin:/opt/intel/composer
_xe_2011_sp1.6.233/bin/intel64:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/
sbin:/sbin:/opt/bin:/opt/intel/composer_xe_2011_sp1.6.233/mpirt/bin/intel64:/opt/
intel/itac/8.0.3.007/bin:/home/username/bin
sge_o_shell:                /bin/bash
sge_o_workdir:              /lustre1/home/username/workdir
sge_o_host:                 t352i
account:                    sge
cwd:                        /home/username
hard resource_list:         mem_req=4G,s_stack=10240K,s_vmem=4G,ssd=TRUE
mail_list:                  username@t352i
notify:                     FALSE
job_name:                   job_name
jobshare:                   0
shell_list:                 NONE:/bin/sh
env_list:
script_file:                userscript
parallel environment:  mpi-fillup range: 128
binding:                    NONE
usage    1:                 cpu=83:14:55:50, mem=2934667.66727 GBs, io=88.79956,
vmem=57.457G, maxvmem=56.291G
binding    1:               NONE
scheduling info:            (Collecting of scheduler job information is turned off)

To immediately delete a job without waiting for job completion when the job execution status is checked and it is found that the job status is incorrect, the qdel command can be used. Specify “qdel job ID.” To delete all the jobs entered by a specific user, specify “qdel -u username.”

Checking the results

The results of a job are output as standard job output in a file called jobname.o job ID, and standard job error output in a file called jobname.e job ID. Please check the files. Detailed information on how much resources an executed job used and so forth can be checked using the qacct command.

qacct -j 1996
==============================================================
qname        month_ssd.q
hostname     t046i
group        usergroup
owner        username
project      NONE
department   defaultdepartment
jobname      jobscript.sh
jobnumber    XXXX
taskid       undefined
account      sge
priority     0
qsub_time    Wed Mar 21 12:35:43 2012
start_time   Wed Mar 21 12:35:47 2012
end_time     Wed Mar 21 12:45:45 2012
granted_pe   NONE
slots        1
failed       0
exit_status  0
ru_wallclock 598
ru_utime     115.199
ru_stime     482.510
ru_maxrss    427756
ru_ixrss     0
ru_ismrss    0
ru_idrss     0
ru_isrss     0
ru_minflt    27853
ru_majflt    44
ru_nswap     0
ru_inblock   2904
ru_oublock   2136
ru_msgsnd    0
ru_msgrcv    0
ru_nsignals  0
ru_nvcsw     23559
ru_nivcsw    6022
cpu          598.520
mem          17179871429.678
io           0.167
iow          0.000
maxvmem      3.929G
arid         undefined

 

How to use high-speed domain (Lustre domain)

Lustre configuration

In this supercomputer system, the user home directory is constructed on the file system comprising the Lustre File System. The Lustre File System is a parallel file system that is mainly used by large-scale supercomputer systems. Two units of MDS (described later), 12 units of OSS (described later), and 72 sets of OST (described later) constitute one unit of the Lustre File System. Two of these file systems were introduced into this system (as of March 2012. To be expanded in fiscal year 2014.).
In Phase2 (March 2014), we introduced three filesystems with Luster File System consisting of 2 MDSs, 18 OSSs, and OST 108 sets, and we are now operating with a total of 5 file systems.

Lustre01

Lustre components

Simply put, a Lustre File System comprises an IO server called the Object Storage Server (OSS), a disk device called the Object Storage Target (OST), and a server called the MetaData Server (MDS) that manages the file metadata as components. A description of each component is provided in the table below:

fat

ComponentDescription of term
Object Storage Server (OSS) Manages OST (described below) and controls the IO requests from the network to OST.
Object Storage Target (OST) This is a block storage device that stores the file data. One OST is considered one virtual disk, and comprises multiple physical disks. The user data is stored in one or more OSTs as one or more objects. It is possible to change the set number of objects per file, and the storage performance can be adjusted by tuning this. In this system configuration, eight OSTs are managed for one unit of OSS.
Meta Data Server (MDS) MDS comprises one server unit (two units with HA configuration in this system) in one Lustre file system, and it manages the position data on objects to which files are assigned and the file attributes data within the file system and guides the file IO requests to the proper OSS. When file IO is starts initially, MDS does not become related to the file IO, and data transmission is conducted directly between the client and OSS. Thus, Lustre File System can have high IO performance.
Meta Data Target (MDT) This is the storage device used to store the metadata (file name, directory, ACL, etc.) for the files in the Lustre File System. It is connected to MDS.

File striping

One of the characteristics of Lustre is its ability to divide one file into multiple segments and store them by dispersing them over multiple OSTs. This function is called file striping. The advantages of file striping is the capability to execute read/write in parallel from the client as one file is stored in multiple OSTs as multiple segments and to read/write large files at high speeds. However, there are also disadvantages to file striping. One disadvantage is that the overhead for handling the dispersed data increases as the file is dispersed over multiple OSTs. Therefore, only when the target file is several GBs or greater is stripe size and stripe count considered effective in general. Since Lustre centrally manages the metadata at the MDS, file operation that concurs with metadata operation (preparation of ls -l or many small-sized files, etc.) concentrates the load on MDS and thus the speed is not very high compared to the equivalent operations on the local file system. Please note this point and avoid operations that place tens of thousands of small-size files in the same directory and so forth (it is better to store them in multiple directories in this case).

Checking the usage status of the home directory

The usage status of the user’s current home directory can be checked using the “lfs quota” command.

 [munakata@t260 ~]$ lfs quota -u username ./
Disk quotas for user username (uid 3055):
     Filesystem  kbytes   quota   limit   grace   files   quota   limit   grace
             ./ 792362704       0 1000000000       -   10935       0       0       -
Disk quotas for group tt-gsfs (gid 10006):
     Filesystem  kbytes   quota   limit   grace   files   quota   limit   grace
             ./ 8250574892       0       0       - 5684904       0       0       -
Item Meaning/description
kbyte File capacity during use (KB)
quota Limitation value for file capacity/number (soft limit)
limit Absolute limitation value for file capacity/number (hard limit)
grace Tolerable period for exceeding the limitation value
files Number of files in use

 

How to set up file striping

To set up file striping, please follow the procedure below. First, check the current stripe count. It can be checked using “lfs getstrip subject file (directory).” (The system default is set to one.)

[munakata@t261 ~]$ ls -ld tmp
drwxr-xr-x 8 username tt-gsfs 4096  6月 19 01:11 2012 tmp
[munakata@t261 ~]$ lfs getstripe tmp
tmp
stripe_count:   1 stripe_size:    1048576 stripe_offset:  -1
tmp/SRA010766-assembly-out22

Stripe settings can be set with the "lfs setstripe" command.

Option name Description
-c Sets up the stripe number and stripe width.
-o Specifies the offset.
-s Specifies the striping size.

 

User data backup

A directory called ./backup is prepared in the user’s home directory by default, and differential backup is executed once a day by the system for the groups of files placed within the directory. Place data that you wish to backup under home directory/backup.

The procedure to restore the backup data is as follows:

First, check the backup acquisition status, and then input the following command:

 [username@t261 ~]$ rdiff-backup -l /backup/username/
Found 0 increments:
Current mirror: Tue Jun 26 08:16:57 2012

Look at the date of the portion of the “Current mirror” above, and specify to restore by using the W3C date format (YYYY-MM-DDTHH:SS:MM).In this example, you can restore files under a “./restore” directory.

[username@t261 ~]$ rdiff-backup --restore-as-of 2012-06-26T08:16:57 /backup/munakata/ ./restore

How to use X client on the login node

To utilize X-Window Client on the login node, please prepare the corresponding X-Window server emulator if you are using Windows. For Mac, the X11 SDK is included in Xcode Tools. Install this so that X11 can be used.

 

Phase2 system calculator usage method (2014/3 usage ~)

 If you are using Phase1 system, data migration from Phase1 system is required in order to use Phase2 system.

 For this data migration,a data migration dedicated node connected to InfiniBand of Phase1 system is prepared for part of Thin node of Phase2 system.
 In the node dedicated to data migration, it becomes accessible to both Phase1 and Phase2 devices,and data movement between Phase1 and Phase2 home areas    
    becomes possible.
 Please login from the gateway of Phase2 system and use it.

 Below is the image diagram.

network image

 Phase1, Phase2 Nodes available for viewing each home area are limited.Please check the table below for details.

Each home region reference nodePhase1 home area pathPhase2 home area path
Phase1-Thin compute node, Midium compute node /home/USER Can not refer
Phase2-Thin compute node, Midium compute node Can not refer /home/USER
Phase2-GW, login node, data migration node /home_ph1/USER /home/USER
Fat compute node /home/USER /home_ph2/USER

 Data migration method

 After logging in to the Phase2 system gateway, you can log in to the data migration node by qlogin with "- l trans" option.
 Below is a command execution example.

 ①qlogin to the data migration node

[username@gw2 ~]$ qlogin -l trans
Your job 82 ("QLOGIN") has been submitted
waiting for interactive job to be scheduled ...
Your interactive job 82 has been successfully scheduled.
Establishing /home/geadmin2/UGER/utilbin/lx-amd64/qlogin_wrapper session to host nt103i ...
username@nt103i's password:         ・・・In this case, nt103 becomes the data migration node.  
Last login: Wed Feb 19 16:27:01 2014 from nt091i
[username@nt103 ~]$                               

 ②Example of executing the cp command

[username@nt103]$ cp /home_ph1/username/(Source file name) /home/username/(Destination file name)    

 ③Example of rsync command execution

[username@nt103]$ rsync -av /home_ph1/username/(Source file / directory name) /home/username/(Destination file / directory name)    

 ※Data migration is also possible on the gateway (GW 2) and login node (login.q) of the Phase2 system,but please cooperate so that data migration can be carried out at
     the data migration node as much as possible.

 Since the OS version differs between the Phase 1 system and the Phase2 system, it may be necessary to recompile the data migrated program.Please note.

$ qsub -pe def_slot 4 -l s_vmem=8G -l mem_req=8G test.sh
$ qsub -pe mpi 2-12 -l s_vmem=4G -l mem_req=4G test.sh