This lesson is in the early stages of development (Alpha version)

Using resources effectively

Overview

Teaching: 10 min
Exercises: 20 min
Questions
  • How can I review past jobs?

  • How can I use this knowledge to create a more accurate submission script?

Objectives
  • Look up job statistics.

  • Make more accurate resource requests in job scripts based on data describing past performance.

We’ve touched on all the skills you need to interact with an HPC cluster: logging in over SSH, loading software modules, submitting parallel jobs, and finding the output. Let’s learn about estimating resource usage and why it might matter.

Estimating Required Resources Using the Scheduler

Although we covered requesting resources from the scheduler earlier with the π code, how do we know what type of resources the software will need in the first place, and its demand for each? In general, unless the software documentation or user testimonials provide some idea, we won’t know how much memory or compute time a program will need.

Read the Documentation

Most HPC facilities maintain documentation as a wiki, a website, or a document sent along when you register for an account. Take a look at these resources, and search for the software you plan to use: somebody might have written up guidance for getting the most out of it.

A convenient way of figuring out the resources required for a job to run successfully is to submit a test job, and then ask the scheduler about its impact using sacct -u yourUsername. You can use this knowledge to set up the next job with a closer estimate of its load on the system. A good general rule is to ask the scheduler for 20% to 30% more time and memory than you expect the job to need. This ensures that minor fluctuations in run time or memory use will not result in your job being cancelled by the scheduler. Keep in mind that if you ask for too much, your job may not run even though enough resources are available, because the scheduler will be waiting for other people’s jobs to finish and free up the resources needed to match what you asked for.

Stats

Since we already submitted pi.py to run on the cluster, we can query the scheduler to see how long our job took and what resources were used. We will use sacct -u yourUsername to get statistics about parallel-pi.sh.

[yourUsername@cirrus-login1 ~]$ sacct -u yourUsername
JobID           JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
3938249       serial-pi   standard      tc036          1  COMPLETED      0:0 
3938249.bat+      batch                 tc036          1  COMPLETED      0:0 
3938249.ext+     extern                 tc036          1  COMPLETED      0:0 
3938265       serial-pi   standard      tc036          1  COMPLETED      0:0 
3938265.bat+      batch                 tc036          1  COMPLETED      0:0 
3938265.ext+     extern                 tc036          1  COMPLETED      0:0 
3938266       serial-pi   standard      tc036          1 OUT_OF_ME+    0:125 
3938266.bat+      batch                 tc036          1 OUT_OF_ME+    0:125 
3938266.ext+     extern                 tc036          1 OUT_OF_ME+    0:125 
3939324      parallel-+   standard      tc036          4  COMPLETED      0:0 
3939324.bat+      batch                 tc036          4  COMPLETED      0:0 
3939324.ext+     extern                 tc036          4  COMPLETED      0:0 
3939324.0        python                 tc036          4  COMPLETED      0:0 

This shows all the jobs we ran recently (note that there are multiple entries per job). To get info about a specific job, we change command slightly.

[yourUsername@cirrus-login1 ~]$ sacct -u yourUsername -l -j 3939324

It will show a lot of info; in fact, every single piece of info collected on your job by the scheduler will show up here. It may be useful to specify the infomation we want using the -o or --format option. Use the command sacct --helpformat to get a list of output options.

[yourUsername@cirrus-login1 ~]$ sacct -u yourUsername -j 3939324 -o 'JobID, AllocCPUS,State,ExitCode,Elapsed,ReqMem'
JobID         AllocCPUS      State ExitCode    Elapsed     ReqMem 
------------ ---------- ---------- -------- ---------- ---------- 
3939324               4  COMPLETED      0:0   00:00:08     28600M 
3939324.bat+          4  COMPLETED      0:0   00:00:08            
3939324.ext+          4  COMPLETED      0:0   00:00:08            
3939324.0             4  COMPLETED      0:0   00:00:06 

Discussion

This view can help compare the amount of time requested and actually used, duration of residence in the queue before launching, and memory footprint on the compute node(s).

How accurate were our estimates?

Improving Resource Requests

Using the job history we can give better time estimates for our jobs. When we overestimate the time needed to complete a job it makes it harder for the queuing system to accurately estimate when resources will become free for other jobs. Practically, this means that the queuing system waits to dispatch our job until the full requested time slot opens, instead of “sneaking it in” a much shorter window where the job could actually finish. Specifying the expected runtime in the submission script more accurately will help alleviate cluster congestion and may get your job dispatched earlier.

Key Points

  • Accurate job scripts help the queuing system efficiently allocate shared resources.