HPC - User Guide¶
This document describes the cluster user guide.
Contents¶
-
- Role of the Frontend
- Home and Scratch directories
- Frontend Access
- On Account Creation
- Environment Modules
-
Slurm - The Workload Manager
- A Quick Tour
- Job Preparation/Submission/Monitoring
Users Rules of the Road¶
By obtaining the access to the cluster, the account holder agrees
- Not to share their login credential with others;
- To exercise the efficient and judicious use of the limited computing resources shared with other users;
- To free up the shared resources (such as disk space) in a timely manner upon completion of jobs;
- Not to keep any security-sensitive, personally identifiable information in the cluster; and
- Not to circumvent the rules of the cluster use in place.
The account holder understands that violation of any of the above can render immediate termination of the cluster access without any advanced notice or prior consent.
Cluster Configuration¶
The block diagram of the cluster is found here. Further detailed cluster hardware specs are here.
Cluster Frontend¶
All users interact with the cluster only through the frontend. Therefore, it is important to understanding the role of the frontend.
Role of the Frontend¶
The frontend is not where the actual computation occurs (in most cases). Rather, it is where users compile their source code, prepare and submit job scripts (instructions containing computing resources needed, how to run the program, where to put the computation results, etc.), monitor the status of and manage running/queued jobs, and retrieve the computation results to users' local machine.
A typical workflow is as follows.
- Users upload to the frontend the source code and, if any, additional data needed for computation.
- Users then compile the code and prepare a job script to be sumitted to the workload manager.
- After submitting the job, it is up to the workload manager to decide whether to accept or reject the job, and if accepted, when and on which machine(s) to execute the job. Various factors go into making the decision, such as the job size, association of the user, etc.
- While the workload manager is doing its magic, users can monitor the status of their pending/running job(s). Users can cancel any pending/running jobs at any point.
- When the job is completed, users either download the results to a local machine or resubmit the job for subsequent computation.
Home and Scratch directories¶
For work space, all users will be assigned a home directory under the /home partition.
In addition, some users will also be assigned a scratch directory under the /scratch partition.
The storage space in user's home directory is limited to about 5GB (quota), but there is no age limit on files and folders stored in it. On the other hand, the scratch directory provides a bigger quota (~ 1TB), but files and folders older than certain ages (TODO: Add link) are periodically purged.
The intended use of these directories are:
- The home directory is where the users keep the settings of user's shell environment, dependent libraries needed for linking and running user's programs, any programs not provided by the cluster, etc.
- The scratch directory is where the users want their program to dump intermediate computation results which are expected to be fetched at a later time.
It is important to understand that the quota caps in the home and scratch partitions do not mean that the disk space described by the quotas is reserved for use. If the cluster is over-subscribed and all users demand their quotas, those partitions will run out of space even before reaching their quota caps. Therefore, to be a good citizen of the cluster it is important to free up disk space for others by deleting unnecessary files/folders.
Frontend Access¶
The frontend can be accessed at
hpc.kyungguk.com
via the SSH client (port: 22).
The SSH access is authenticated in either of two ways:
-
Two-factor authentication (2FA): User account password and one-time password generated by your OTP-generator For the OTP-generator apps, options are:
-
Public/private key authentication: See password-less SSH.
The transfer of files to and from the cluster can be done any clients that support the SSH protocol, such as
scp-- OpenSSH secure file copysftp-- OpenSSH secure file transferrsync-- a fast, versatile, remote (and local) file-copying tool
On Account Creation¶
Upon approval, users will receive a temporary password for user's account and a QR code to be scanned by an OTP-generator app. Follow the instructions below for a first-time login.
-
Use the QR code to register the authentication tokens (encoded in the QR code) to your OTP-generator app.
-
Use the frontend access information and your credentials to login to the frontend.
-
Reset your user password and optionally configure password-less SSH.
-
Confirm that you have the home and scratch directories set up
[user@node0 ~]$ ls -d /home/$USER /scratch/$USER /home/user /scratch/user -
Confirm your disk quota (see the quota limits here)
[user@node0 ~]$ xfs_quota -c "quota -bih $UID" /home Disk quotas for User user (2000) Filesystem Blocks Quota Limit Warn/Time Files Quota Limit Warn/Time Mounted on /dev/mapper/centos_node0-home 75.8M 5G 7G 00 [------] 1.4k 0 0 00 [------] /home [user@node0 ~]$ xfs_quota -c "quota -bih $UID" /scratch Disk quotas for User user (2000) Filesystem Blocks Quota Limit Warn/Time Files Quota Limit Warn/Time Mounted on /dev/mapper/centos_node0-scratch 54.2M 1T 1.2T 00 [------] 877 0 0 00 [------] /scratch -
Confirm your frontend resource limits are set properly (see the resource limits here)
[user@node0 ~]$ systemctl -t slice show user-$UID.slice ... CPUQuotaPerSecUSec=2s ... MemoryLimit=4294967296 ... TasksMax=100 ...
Environment Modules¶
The Environment Modules package is a tool that simplify shell initialization for, e.g., source code compilation.
For example, some users may use an older version of GCC with some version of MPI, while other users may user a newer version of GCC and a different version of MPI.
Modulefiles for the individual software contain the information needed to configure the shell for an application, e.g., setting shell environment variables such as PATH, MANPATH, LD_LIBRARY_PATH, etc.
This way, different versions of the same software, or different but conflicting applications, can co-exist without messing up the shell environment.
To list currently loaded modulefiles,
[user@node0 ~]$ module list
No Modulefiles Currently Loaded.
To list available modulefiles,
[user@node0 ~]$ module avail
-------------------------------------------- /usr/share/Modules/modulefiles --------------------------------------------
dot module-git module-info modules null use.own
------------------------------------------------- /etc/modulefiles/mpi -------------------------------------------------
mvapich2/2.3.4(default) openmpi/4.0.5(default)
---------------------------------------------- /etc/modulefiles/compiler -----------------------------------------------
gcc/10.2.0 go/go1.15.2(default) llvm/5.0.2
gcc/8.4.0(default) llvm/10.0.1 llvm/9.0.1(default)
------------------------------------------------ /etc/modulefiles/misc -------------------------------------------------
opa2slurm/38c3fda9390e1084b53d5b303fcf93ba6f0eaeae(default)
singularity/3.6.3(default)
------------------------------------------------ /etc/modulefiles/libs -------------------------------------------------
jsoncpp/1.9.3-gcc jsoncpp/1.9.3-llvm openpa/1.0.4
To print a short description of a specific modulefile,
[user@node0 ~]$ module whatis mvapich2
mvapich2 : loads the MVAPICH2 environment
To see what the modulefile is going to do when loaded
[user@node0 ~]$ module show mvapich2
-------------------------------------------------------------------
/etc/modulefiles/mpi/mvapich2/2.3.4:
module-whatis loads the MVAPICH2 environment
setenv MV2_ENABLE_AFFINITY 0
prepend-path PATH /opt/mvapich2/2.3.4/bin
prepend-path MANPATH /opt/mvapich2/2.3.4/share/man
prepend-path LD_LIBRARY_PATH /opt/mvapich2/2.3.4/lib
-------------------------------------------------------------------
To load a modulefile,
[user@node0 ~]$ module load mvapich2
[user@node0 ~]$ module list
Currently Loaded Modulefiles:
1) mvapich2/2.3.4
To see if the shell environment variables are set,
[user@node0 ~]$ echo $LD_LIBRARY_PATH
/opt/mvapich2/2.3.4/lib
To unload a modulefile,
[user@node0 ~]$ module unload mvapich2
[user@node0 ~]$ module list
No Modulefiles Currently Loaded.
NOTE ON FORTRAN USER
Only the gfortran compiler is available.
Furthermore, the MPI programs available are compiled with the system-default compiler, which is gcc (GCC) 4.8.5.
So, it is recommended to use the system-default compiler (i.e., do not load any other compilers), especially if you are using the fortran's MPI module in your code.
Slurm - The Workload Manager¶
A Quick Tour¶
Slurm is a workload manager used in parallel computing.
To see the partitions (a group of computing resources; analogous to job queue) available
[user@node0 ~]$ sinfo -al
Sat Sep 19 09:11:49 2020
PARTITION AVAIL TIMELIMIT JOB_SIZE ROOT OVERSUBS GROUPS NODES STATE NODELIST
compute* up infinite 1-infinite no NO all 2 idle node[1-2]
frontend up infinite 1-infinite no NO all 1 idle node0
There are two partitions in service.
The compute partition is made of two nodes, node[1-2], while the frontend partition is made of one node, the frontend.
The fact that some limits are infinite (e.g., TIMELIMIT) does not mean that users have no limit in acquiring those resources.
As will be shown later, Slurm allows a fine-grained control of how limits are set per job.
To see more details about the partition information
[user@node0 ~]$ scontrol show part
PartitionName=compute
AllowGroups=ALL AllowAccounts=ALL DenyQos=compile
AllocNodes=ALL Default=YES QoS=N/A
DefaultTime=NONE DisableRootJobs=YES ExclusiveUser=NO GraceTime=0 Hidden=NO
MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED
Nodes=node[1-2]
PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
OverTimeLimit=NONE PreemptMode=REQUEUE
State=UP TotalCPUs=224 TotalNodes=2 SelectTypeParameters=NONE
JobDefaults=(null)
DefMemPerCPU=1715 MaxMemPerNode=192091
PartitionName=frontend
AllowGroups=ALL AllowAccounts=ALL AllowQos=compile,debug
AllocNodes=ALL Default=NO QoS=N/A
DefaultTime=NONE DisableRootJobs=YES ExclusiveUser=NO GraceTime=0 Hidden=NO
MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=0 LLN=NO MaxCPUsPerNode=56
Nodes=node0
PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
OverTimeLimit=NONE PreemptMode=REQUEUE
State=UP TotalCPUs=64 TotalNodes=1 SelectTypeParameters=NONE
JobDefaults=(null)
DefMemPerCPU=3001 MaxMemPerNode=168085
Some noteworthy information is
- The
computepartition has a total of 224 CPUs (or threads if hyperthread is enabled, which is the case in this cluster) and 192091 MB RAM per node, and accept all butcompileQOS (Quality of Service). - The
frontendpartition has a total of 64 CPUs but only 56 of them (MaxCPUsPerNode) are usable for jobs. It only acceptscompileanddebugQOSs.
The following command displays detailed information of individual nodes
[user@node0 ~]$ scontrol show node=node2
NodeName=node2 Arch=x86_64 CoresPerSocket=28
CPUAlloc=0 CPUTot=112 CPULoad=0.01
AvailableFeatures=2.2GHz
ActiveFeatures=2.2GHz
Gres=(null)
NodeAddr=node2 NodeHostName=node2 Version=20.02.4
OS=Linux 3.10.0-1127.19.1.el7.x86_64 #1 SMP Tue Aug 25 17:23:54 UTC 2020
RealMemory=192091 AllocMem=0 FreeMem=186447 Sockets=2 Boards=1
State=IDLE ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
Partitions=compute
BootTime=2020-09-14T13:04:17 SlurmdStartTime=2020-09-17T08:53:56
CfgTRES=cpu=112,mem=192091M,billing=112
AllocTRES=
CapWatts=n/a
CurrentWatts=0 AveWatts=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
It is important to understand the difference among CPU, Thread, and Core in Slurm.
A core is a physical processing unit -- typically, what the rest of the world refers to as a "CPU" chip is a multiple core processor with many cores.
A thread is a logical processing unit by which instructions in a computer program are executed.
A core typically has one thread, but if it is hyper-threaded, the OS recognizes it having two threads.
Therefore in the latter case, two programs can be executed simultaneously (by user's perspective) by one core, although the performance gain won't be 2x.
In Slurm, a CPU is synomous to a thread.
For example, node1 and node2 have 112 CPUs, but since there are two threads per core, the total number of physical cores are 56.
The current setting is that the CPU allocation is at the Socket level. The hyperthreading is enabled, but a task is allocated to a core.
To see all of your running/queued jobs
[user@node0 ~]$ squeue -al --me
Sat Sep 19 09:32:14 2020
JOBID PARTITION NAME USER STATE TIME TIME_LIMI NODES NODELIST(REASON)
178 compute sleep user RUNNING 0:04 2-00:00:00 1 node1
There is one job named sleep whose JOBID is 178 and which has been RUNNING for 4 sec on the compute partition.
It requested one node, for which node1 has been assigned, with a TIME limit of 2 days.
To see more details about the job status,
[user@node0 ~]$ scontrol show job=178
JobId=178 JobName=sleep
UserId=user(2000) GroupId=users(100) MCS_label=N/A
Priority=61697 Nice=0 Account=research QOS=normal
JobState=COMPLETED Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0
RunTime=00:00:30 TimeLimit=2-00:00:00 TimeMin=N/A
SubmitTime=2020-09-19T09:32:10 EligibleTime=2020-09-19T09:32:10
AccrueTime=Unknown
StartTime=2020-09-19T09:32:10 EndTime=2020-09-19T09:32:40 Deadline=N/A
PreemptEligibleTime=2020-09-19T15:32:10 PreemptTime=None
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2020-09-19T09:32:10
Partition=compute AllocNode:Sid=123.213.22.160:178466
ReqNodeList=(null) ExcNodeList=(null)
NodeList=node1
BatchHost=node1
NumNodes=1 NumCPUs=2 NumTasks=0 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
TRES=cpu=2,mem=3430M,node=1,billing=2
Socks/Node=* NtasksPerN:B:S:C=0:0:*:1 CoreSpec=*
MinCPUsNode=1 MinMemoryCPU=1715M MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
Command=sleep
WorkDir=/scratch/user/test
Power=
MailUser=(null) MailType=NONE
This job is associated with research (bank) account and normal QOS, and by the time this command is issued, the job has already been COMPLETED.
The working directory of this job was /scratch/user/test.
As alluded earlier, Slurm provides a fine-grained control of how resource limits are set. The resource limits are set by job's Association which is formed by a User Account, a Partition to which users submit their job, a Bank Account to which users charge the usage, and a Quality of Service (QOS) that the job requires. The individual entities -- user account, partition, bank account, and QOS -- can have their own limits. The association determines the maximum limits that a job can request. A job can, and should, request resources smaller than, or equal to, the limits set by job's association; it cannot request resources exceeding those limits.
For example, the following command lists the associations of user
[user@node0 ~]$ sacctmgr show user withass where name=$USER acc=research
User Def Acct Admin Cluster Account Partition Share Priority MaxJobs MaxNodes MaxCPUs MaxSubmit MaxWall MaxCPUMins QOS Def QOS
---------- ---------- --------- ---------- ---------- ---------- --------- ---------- ------- -------- -------- --------- ----------- ----------- -------------------- ---------
user research None hpc.kyung+ research frontend parent 1 2-00:00:00 debug
user research None hpc.kyung+ research compute 20 3 2-00:00:00 debug,normal normal
This user has two associations all of which are link to the research account.
The user can request resources from the frontend and compute partitions.
The resource limits are different depending on the associations.
- For the association with the
frontendpartition, the user has a fair share inherited from the account, can submit at maximum one job, request no more than 2 days of walltime, and only use thedebugQOS. - For the association with the
computepartition, the user has a fair share of 20, can submit at maximum three jobs, request no more than 2 days of walltime, and use thedebugandnoramlQOSs.
The Def Acct and Def QOS fields indicate that these values, if any, will be used if their values are not explicitly specified on job submission.
QOSs can impose their own limits, which take precedence over the association limits.
The following command list information about the debug and noraml QOSs.
[user@node0 ~]$ sacctmgr show qos where names=debug,normal
Name Priority GraceTime Preempt PreemptExemptTime PreemptMode Flags UsageThres UsageFactor GrpTRES GrpTRESMins GrpTRESRunMin GrpJobs GrpSubmit GrpWall MaxTRES MaxTRESPerNode MaxTRESMins MaxWall MaxTRESPU MaxJobsPU MaxSubmitPU MaxTRESPA MaxJobsPA MaxSubmitPA MinTRES
---------- ---------- ---------- ---------- ------------------- ----------- ---------------------------------------- ---------- ----------- ------------- ------------- ------------- ------- --------- ----------- ------------- -------------- ------------- ----------- ------------- --------- ----------- ------------- --------- ----------- -------------
debug 1 00:00:00 cluster DenyOnLimit 1.000000 cpu=16,mem=48+ 02:00:00 1
normal 0 00:00:00 cluster DenyOnLimit 1.000000
Most of the fields of the normal QOS are unspecified, in which case the limits on the association, if any, will be imposed.
In case where jobs under the normal QOS is preempted, jobs that can complete within 6 hours will be exempt from preemption.
The default preemption mode is CANCEL AND REQUEUE.
Job preemption, if at all, will be granted only sparingly.
The debug QOS is, as the name implies, for use with code debugging and testing.
It has a slightly higher priority and jobs under this QOS will not be preempted.
To prevent obusing this QOS, only one job per user is allowed under this QOS with maximum CPU count per node of 8 and maximum walltime of 2 hours.
The most likely reason for your job sitting on the queue too long is the low scheduling priority.
Scheduling of all pending jobs in the queue for run is determined by their priority.
One can use the sprio command to query all pending jobs' priority.
Job's priority is determined by various factors.
Among them is the FairShare.
One can query this information using sshare.