HPC - User Guide¶

This document describes the cluster user guide.

Contents¶

Users Rules of the Road
Cluster Configuration
Cluster Frontend
- Role of the Frontend
- Home and Scratch directories
- Frontend Access
- On Account Creation
- Environment Modules
Slurm - The Workload Manager
- A Quick Tour
- Job Preparation/Submission/Monitoring

Users Rules of the Road¶

By obtaining the access to the cluster, the account holder agrees

Not to share their login credential with others;
To exercise the efficient and judicious use of the limited computing resources shared with other users;
To free up the shared resources (such as disk space) in a timely manner upon completion of jobs;
Not to keep any security-sensitive, personally identifiable information in the cluster; and
Not to circumvent the rules of the cluster use in place.

The account holder understands that violation of any of the above can render immediate termination of the cluster access without any advanced notice or prior consent.

Cluster Configuration¶

The block diagram of the cluster is found here. Further detailed cluster hardware specs are here.

Cluster Frontend¶

All users interact with the cluster only through the frontend. Therefore, it is important to understanding the role of the frontend.

Role of the Frontend¶

The frontend is not where the actual computation occurs (in most cases). Rather, it is where users compile their source code, prepare and submit job scripts (instructions containing computing resources needed, how to run the program, where to put the computation results, etc.), monitor the status of and manage running/queued jobs, and retrieve the computation results to users' local machine.

A typical workflow is as follows.

Users upload to the frontend the source code and, if any, additional data needed for computation.
Users then compile the code and prepare a job script to be sumitted to the workload manager.
After submitting the job, it is up to the workload manager to decide whether to accept or reject the job, and if accepted, when and on which machine(s) to execute the job. Various factors go into making the decision, such as the job size, association of the user, etc.
While the workload manager is doing its magic, users can monitor the status of their pending/running job(s). Users can cancel any pending/running jobs at any point.
When the job is completed, users either download the results to a local machine or resubmit the job for subsequent computation.

Home and Scratch directories¶

For work space, all users will be assigned a home directory under the /home partition. In addition, some users will also be assigned a scratch directory under the /scratch partition.

The storage space in user's home directory is limited to about 5GB (quota), but there is no age limit on files and folders stored in it. On the other hand, the scratch directory provides a bigger quota (~ 1TB), but files and folders older than certain ages (TODO: Add link) are periodically purged.

The intended use of these directories are:

The home directory is where the users keep the settings of user's shell environment, dependent libraries needed for linking and running user's programs, any programs not provided by the cluster, etc.
The scratch directory is where the users want their program to dump intermediate computation results which are expected to be fetched at a later time.

It is important to understand that the quota caps in the home and scratch partitions do not mean that the disk space described by the quotas is reserved for use. If the cluster is over-subscribed and all users demand their quotas, those partitions will run out of space even before reaching their quota caps. Therefore, to be a good citizen of the cluster it is important to free up disk space for others by deleting unnecessary files/folders.

Frontend Access¶

The frontend can be accessed at

hpc.kyungguk.com

via the SSH client (port: 22).

The SSH access is authenticated in either of two ways:

Two-factor authentication (2FA): User account password and one-time password generated by your OTP-generator For the OTP-generator apps, options are:
- FreeOTP for Android/iOS
- Google Authenticator for Android/iOS
Public/private key authentication: See password-less SSH.

The transfer of files to and from the cluster can be done any clients that support the SSH protocol, such as

On Account Creation¶

Upon approval, users will receive a temporary password for user's account and a QR code to be scanned by an OTP-generator app. Follow the instructions below for a first-time login.

Use the QR code to register the authentication tokens (encoded in the QR code) to your OTP-generator app.
Use the frontend access information and your credentials to login to the frontend.
Reset your user password and optionally configure password-less SSH.

Confirm that you have the home and scratch directories set up

[user@node0 ~]$ ls -d /home/$USER /scratch/$USER
/home/user  /scratch/user

Confirm your disk quota (see the quota limits here)

[user@node0 ~]$ xfs_quota -c "quota -bih $UID" /home
Disk quotas for User user (2000)
Filesystem   Blocks  Quota  Limit Warn/Time      Files  Quota  Limit Warn/Time    Mounted on
/dev/mapper/centos_node0-home
75.8M     5G     7G  00 [------]    1.4k      0      0  00 [------] /home
[user@node0 ~]$ xfs_quota -c "quota -bih $UID" /scratch
Disk quotas for User user (2000)
Filesystem   Blocks  Quota  Limit Warn/Time      Files  Quota  Limit Warn/Time    Mounted on
/dev/mapper/centos_node0-scratch
54.2M     1T   1.2T  00 [------]     877      0      0  00 [------] /scratch

Confirm your frontend resource limits are set properly (see the resource limits here)

[user@node0 ~]$ systemctl -t slice show user-$UID.slice
...
CPUQuotaPerSecUSec=2s
...
MemoryLimit=4294967296
...
TasksMax=100
...

Environment Modules¶

The Environment Modules package is a tool that simplify shell initialization for, e.g., source code compilation. For example, some users may use an older version of GCC with some version of MPI, while other users may user a newer version of GCC and a different version of MPI. Modulefiles for the individual software contain the information needed to configure the shell for an application, e.g., setting shell environment variables such as PATH, MANPATH, LD_LIBRARY_PATH, etc. This way, different versions of the same software, or different but conflicting applications, can co-exist without messing up the shell environment.

To list currently loaded modulefiles,

[user@node0 ~]$ module list
No Modulefiles Currently Loaded.

To list available modulefiles,

[user@node0 ~]$ module avail

-------------------------------------------- /usr/share/Modules/modulefiles --------------------------------------------
dot         module-git  module-info modules     null        use.own

------------------------------------------------- /etc/modulefiles/mpi -------------------------------------------------
mvapich2/2.3.4(default) openmpi/4.0.5(default)

---------------------------------------------- /etc/modulefiles/compiler -----------------------------------------------
gcc/10.2.0           go/go1.15.2(default) llvm/5.0.2
gcc/8.4.0(default)   llvm/10.0.1          llvm/9.0.1(default)

------------------------------------------------ /etc/modulefiles/misc -------------------------------------------------
opa2slurm/38c3fda9390e1084b53d5b303fcf93ba6f0eaeae(default)
singularity/3.6.3(default)

------------------------------------------------ /etc/modulefiles/libs -------------------------------------------------
jsoncpp/1.9.3-gcc  jsoncpp/1.9.3-llvm openpa/1.0.4

To print a short description of a specific modulefile,

[user@node0 ~]$ module whatis  mvapich2
mvapich2             : loads the MVAPICH2 environment

To see what the modulefile is going to do when loaded

[user@node0 ~]$ module show mvapich2
-------------------------------------------------------------------
/etc/modulefiles/mpi/mvapich2/2.3.4:

module-whatis     loads the MVAPICH2 environment
setenv           MV2_ENABLE_AFFINITY 0
prepend-path     PATH /opt/mvapich2/2.3.4/bin
prepend-path     MANPATH /opt/mvapich2/2.3.4/share/man
prepend-path     LD_LIBRARY_PATH /opt/mvapich2/2.3.4/lib
-------------------------------------------------------------------

To load a modulefile,

[user@node0 ~]$ module load mvapich2
[user@node0 ~]$ module list
Currently Loaded Modulefiles:
1) mvapich2/2.3.4

To see if the shell environment variables are set,

[user@node0 ~]$ echo $LD_LIBRARY_PATH
/opt/mvapich2/2.3.4/lib

To unload a modulefile,

[user@node0 ~]$ module unload mvapich2
[user@node0 ~]$ module list
No Modulefiles Currently Loaded.

NOTE ON FORTRAN USER

Only the gfortran compiler is available. Furthermore, the MPI programs available are compiled with the system-default compiler, which is gcc (GCC) 4.8.5. So, it is recommended to use the system-default compiler (i.e., do not load any other compilers), especially if you are using the fortran's MPI module in your code.

Slurm - The Workload Manager¶

A Quick Tour¶

Slurm is a workload manager used in parallel computing.

To see the partitions (a group of computing resources; analogous to job queue) available

[user@node0 ~]$ sinfo -al
Sat Sep 19 09:11:49 2020
PARTITION AVAIL  TIMELIMIT   JOB_SIZE ROOT OVERSUBS     GROUPS  NODES       STATE NODELIST
compute*     up   infinite 1-infinite   no       NO        all      2        idle node[1-2]
frontend     up   infinite 1-infinite   no       NO        all      1        idle node0

There are two partitions in service. The compute partition is made of two nodes, node[1-2], while the frontend partition is made of one node, the frontend.

The fact that some limits are infinite (e.g., TIMELIMIT) does not mean that users have no limit in acquiring those resources. As will be shown later, Slurm allows a fine-grained control of how limits are set per job.

To see more details about the partition information

[user@node0 ~]$ scontrol show part
PartitionName=compute
   AllowGroups=ALL AllowAccounts=ALL DenyQos=compile
   AllocNodes=ALL Default=YES QoS=N/A
   DefaultTime=NONE DisableRootJobs=YES ExclusiveUser=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=node[1-2]
   PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
   OverTimeLimit=NONE PreemptMode=REQUEUE
   State=UP TotalCPUs=224 TotalNodes=2 SelectTypeParameters=NONE
   JobDefaults=(null)
   DefMemPerCPU=1715 MaxMemPerNode=192091

PartitionName=frontend
   AllowGroups=ALL AllowAccounts=ALL AllowQos=compile,debug
   AllocNodes=ALL Default=NO QoS=N/A
   DefaultTime=NONE DisableRootJobs=YES ExclusiveUser=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=0 LLN=NO MaxCPUsPerNode=56
   Nodes=node0
   PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
   OverTimeLimit=NONE PreemptMode=REQUEUE
   State=UP TotalCPUs=64 TotalNodes=1 SelectTypeParameters=NONE
   JobDefaults=(null)
   DefMemPerCPU=3001 MaxMemPerNode=168085

Some noteworthy information is

The compute partition has a total of 224 CPUs (or threads if hyperthread is enabled, which is the case in this cluster) and 192091 MB RAM per node, and accept all but compile QOS (Quality of Service).
The frontend partition has a total of 64 CPUs but only 56 of them (MaxCPUsPerNode) are usable for jobs. It only accepts compile and debug QOSs.

The following command displays detailed information of individual nodes

[user@node0 ~]$ scontrol show node=node2
NodeName=node2 Arch=x86_64 CoresPerSocket=28
   CPUAlloc=0 CPUTot=112 CPULoad=0.01
   AvailableFeatures=2.2GHz
   ActiveFeatures=2.2GHz
   Gres=(null)
   NodeAddr=node2 NodeHostName=node2 Version=20.02.4
   OS=Linux 3.10.0-1127.19.1.el7.x86_64 #1 SMP Tue Aug 25 17:23:54 UTC 2020
   RealMemory=192091 AllocMem=0 FreeMem=186447 Sockets=2 Boards=1
   State=IDLE ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=compute
   BootTime=2020-09-14T13:04:17 SlurmdStartTime=2020-09-17T08:53:56
   CfgTRES=cpu=112,mem=192091M,billing=112
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=0 AveWatts=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

It is important to understand the difference among CPU, Thread, and Core in Slurm. A core is a physical processing unit -- typically, what the rest of the world refers to as a "CPU" chip is a multiple core processor with many cores. A thread is a logical processing unit by which instructions in a computer program are executed. A core typically has one thread, but if it is hyper-threaded, the OS recognizes it having two threads. Therefore in the latter case, two programs can be executed simultaneously (by user's perspective) by one core, although the performance gain won't be 2x. In Slurm, a CPU is synomous to a thread. For example, node1 and node2 have 112 CPUs, but since there are two threads per core, the total number of physical cores are 56.

The current setting is that the CPU allocation is at the Socket level. The hyperthreading is enabled, but a task is allocated to a core.

To see all of your running/queued jobs

[user@node0 ~]$ squeue -al --me
Sat Sep 19 09:32:14 2020
JOBID PARTITION     NAME     USER    STATE       TIME TIME_LIMI  NODES NODELIST(REASON)
  178   compute    sleep     user  RUNNING       0:04 2-00:00:00      1 node1

There is one job named sleep whose JOBID is 178 and which has been RUNNING for 4 sec on the compute partition. It requested one node, for which node1 has been assigned, with a TIME limit of 2 days.

To see more details about the job status,

[user@node0 ~]$ scontrol show job=178
JobId=178 JobName=sleep
UserId=user(2000) GroupId=users(100) MCS_label=N/A
Priority=61697 Nice=0 Account=research QOS=normal
JobState=COMPLETED Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0
RunTime=00:00:30 TimeLimit=2-00:00:00 TimeMin=N/A
SubmitTime=2020-09-19T09:32:10 EligibleTime=2020-09-19T09:32:10
AccrueTime=Unknown
StartTime=2020-09-19T09:32:10 EndTime=2020-09-19T09:32:40 Deadline=N/A
PreemptEligibleTime=2020-09-19T15:32:10 PreemptTime=None
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2020-09-19T09:32:10
Partition=compute AllocNode:Sid=123.213.22.160:178466
ReqNodeList=(null) ExcNodeList=(null)
NodeList=node1
BatchHost=node1
NumNodes=1 NumCPUs=2 NumTasks=0 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
TRES=cpu=2,mem=3430M,node=1,billing=2
Socks/Node=* NtasksPerN:B:S:C=0:0:*:1 CoreSpec=*
MinCPUsNode=1 MinMemoryCPU=1715M MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
Command=sleep
WorkDir=/scratch/user/test
Power=
MailUser=(null) MailType=NONE

This job is associated with research (bank) account and normal QOS, and by the time this command is issued, the job has already been COMPLETED. The working directory of this job was /scratch/user/test.

As alluded earlier, Slurm provides a fine-grained control of how resource limits are set. The resource limits are set by job's Association which is formed by a User Account, a Partition to which users submit their job, a Bank Account to which users charge the usage, and a Quality of Service (QOS) that the job requires. The individual entities -- user account, partition, bank account, and QOS -- can have their own limits. The association determines the maximum limits that a job can request. A job can, and should, request resources smaller than, or equal to, the limits set by job's association; it cannot request resources exceeding those limits.

For example, the following command lists the associations of user

[user@node0 ~]$ sacctmgr show user withass where name=$USER acc=research
      User   Def Acct     Admin    Cluster    Account  Partition     Share   Priority MaxJobs MaxNodes  MaxCPUs MaxSubmit     MaxWall  MaxCPUMins                  QOS   Def QOS
---------- ---------- --------- ---------- ---------- ---------- --------- ---------- ------- -------- -------- --------- ----------- ----------- -------------------- ---------
      user   research      None hpc.kyung+   research   frontend    parent                                              1  2-00:00:00                            debug
      user   research      None hpc.kyung+   research    compute        20                                              3  2-00:00:00                     debug,normal    normal

This user has two associations all of which are link to the research account. The user can request resources from the frontend and compute partitions. The resource limits are different depending on the associations.

For the association with the frontend partition, the user has a fair share inherited from the account, can submit at maximum one job, request no more than 2 days of walltime, and only use the debug QOS.
For the association with the compute partition, the user has a fair share of 20, can submit at maximum three jobs, request no more than 2 days of walltime, and use the debug and noraml QOSs.

The Def Acct and Def QOS fields indicate that these values, if any, will be used if their values are not explicitly specified on job submission.

QOSs can impose their own limits, which take precedence over the association limits. The following command list information about the debug and noraml QOSs.

[user@node0 ~]$ sacctmgr show qos where names=debug,normal
      Name   Priority  GraceTime    Preempt   PreemptExemptTime PreemptMode                                    Flags UsageThres UsageFactor       GrpTRES   GrpTRESMins GrpTRESRunMin GrpJobs GrpSubmit     GrpWall       MaxTRES MaxTRESPerNode   MaxTRESMins     MaxWall     MaxTRESPU MaxJobsPU MaxSubmitPU     MaxTRESPA MaxJobsPA MaxSubmitPA       MinTRES
---------- ---------- ---------- ---------- ------------------- ----------- ---------------------------------------- ---------- ----------- ------------- ------------- ------------- ------- --------- ----------- ------------- -------------- ------------- ----------- ------------- --------- ----------- ------------- --------- ----------- -------------
     debug          1   00:00:00                                    cluster                              DenyOnLimit               1.000000                                                                                       cpu=16,mem=48+                  02:00:00                                   1
    normal          0   00:00:00                                    cluster                              DenyOnLimit               1.000000

Most of the fields of the normal QOS are unspecified, in which case the limits on the association, if any, will be imposed. In case where jobs under the normal QOS is preempted, jobs that can complete within 6 hours will be exempt from preemption. The default preemption mode is CANCEL AND REQUEUE. Job preemption, if at all, will be granted only sparingly.

The debug QOS is, as the name implies, for use with code debugging and testing. It has a slightly higher priority and jobs under this QOS will not be preempted. To prevent obusing this QOS, only one job per user is allowed under this QOS with maximum CPU count per node of 8 and maximum walltime of 2 hours.

The most likely reason for your job sitting on the queue too long is the low scheduling priority. Scheduling of all pending jobs in the queue for run is determined by their priority. One can use the sprio command to query all pending jobs' priority. Job's priority is determined by various factors. Among them is the FairShare. One can query this information using sshare.