Using resources effectively

Overview

Teaching: 15 min
Exercises: 10 min

Questions

How do we monitor our jobs?

How can I get my jobs scheduled more easily?

Objectives

Understand how to look up job statistics and profile code.

Understand job size implications.

We now know virtually everything we need to know about getting stuff on a cluster. We can log on, submit different types of jobs, use pre-installed software, and install and use software of our own. What we need to do now is use the systems effectively.

Estimating required resources using the scheduler

Although we covered requesting resources from the scheduler earlier, how do we know what type of resources the software will need in the first place, and the extent of its demand for each?

Unless the developers or prior users have provided some idea, we don’t. Not until we’ve tried it ourselves at least once. We’ll need to benchmark our job and experiment with it before we know how how great its demand for system resources.

Read the documentation

Most HPC facilities maintain documentation as a wiki, website, or a document sent along when you register for an account. Take a look at these resources, and search for the software of interest: somebody might have written up guidance for getting the most out of it.

The most effective way of figuring out the resources required for a job to run successfully needs is to submit a test job, and then ask the scheduler about its impact using sacct.

[nsid@platolgn01 ~]$ sacct

This shows all the jobs we ran recently (note that there are multiple entries per job). To get info about a specific job, we change command slightly.

[nsid@platolgn01 ~]$ sacct -j 727107

This shows a lot of information, even more if you use the long display option, -l. Plato also has a convenience command, seff, that gives a terser output.

[nsid@platolgn01 ~]$ seff 727107

Job ID: 727107
Cluster: plato
User/Group: nsid/nsid
State: COMPLETED (exit code 0)
Nodes: 1
Cores per node: 4
CPU Utilized: 00:00:04
CPU Efficiency: 25.00% of 00:00:16 core-walltime
Job Wall-clock time: 00:00:04
Memory Utilized: 1.12 MB
Memory Efficiency: 0.05% of 2.00 GB

Some interesting fields include the following:

Hostname: Where did your job run?
MaxRSS: What was the maximum amount of memory used?
Elapsed: How long did the job take?
State: What is the job currently doing/what happened to it?
MaxDiskRead: Amount of data read from disk.
MaxDiskWrite: Amount of data written to disk.

You can use this knowledge to set up the next job with a close estimate of its load on the system. A good general rule is to ask the scheduler for 20% to 30% more time and memory than you expect the job to need. This ensures that minor fluctuations in run time or memory use will not result in your job being cancelled by the scheduler. Keep in mind that if you ask for too much, your job may not run even though enough resources are available, because the scheduler will be waiting to match what you asked for.

Measuring the statistics of currently running tasks

Connecting to Nodes

Typically, clusters allow users to connect to compute nodes where they have running jobs. This is useful to check on a running job and see how it’s doing. To know which node to connect to, use squeue. Then, run ssh nodename. Once you are on the node of interest, use programs such as top or ps, as described below.

Monitor system processes with `top`

The most reliable way to check current system stats is with top. Some sample output might look like the following (type q to exit top):

[nsid@platolgn01 ~]$ top

top - 21:00:19 up  3:07,  1 user,  load average: 1.06, 1.05, 0.96
Tasks: 311 total,   1 running, 222 sleeping,   0 stopped,   0 zombie
%Cpu(s):  7.2 us,  3.2 sy,  0.0 ni, 89.0 id,  0.0 wa,  0.2 hi,  0.2 si,  0.0 st
KiB Mem : 16303428 total,  8454704 free,  3194668 used,  4654056 buff/cache
KiB Swap:  8220668 total,  8220668 free,        0 used. 11628168 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 1693 jeff      20   0 4270580 346944 171372 S  29.8  2.1   9:31.89 gnome-shell
 3140 jeff      20   0 3142044 928972 389716 S  27.5  5.7  13:30.29 Web Content
 3057 jeff      20   0 3115900 521368 231288 S  18.9  3.2  10:27.71 firefox
 6007 jeff      20   0  813992 112336  75592 S   4.3  0.7   0:28.25 tilix
 1742 jeff      20   0  975080 164508 130624 S   2.0  1.0   3:29.83 Xwayland
    1 root      20   0  230484  11924   7544 S   0.3  0.1   0:06.08 systemd
   68 root      20   0       0      0      0 I   0.3  0.0   0:01.25 kworker/4:1
 2913 jeff      20   0  965620  47892  37432 S   0.3  0.3   0:11.76 code
    2 root      20   0       0      0      0 S   0.0  0.0   0:00.02 kthreadd

Overview of the most important fields:

PID: What is the numerical id of each process?
USER: Who started the process?
RES: What is the amount of memory currently being used by a process (in bytes)?
%CPU: How much of a CPU is each process using? Values higher than 100 percent indicate that a process is running in parallel.
%MEM: What percent of system memory is a process using?
TIME+: How much CPU time has a process used so far? Processes using 2 CPUs accumulate time at twice the normal rate.
COMMAND: What command was used to launch a process?

htop provides a curses-based overlay for top, producing a better-organized and “prettier” dashboard in your terminal.

`ps`

To show all processes from your current session, type ps.

[nsid@platolgn01 ~]$ ps

  PID TTY          TIME CMD
15113 pts/5    00:00:00 bash
15218 pts/5    00:00:00 ps

Note that this will only show processes from our current session. To show all processes you own (regardless of whether they are part of your current session or not), you can use ps ux.

[nsid@platolgn01 ~]$ ps ux

    USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
nsid  67780  0.0  0.0 149140  1724 pts/81   R+   13:51   0:00 ps ux
nsid  73083  0.0  0.0 142392  2136 ?        S    12:50   0:00 sshd: nsid@pts/81
nsid  73087  0.0  0.0 114636  3312 pts/81   Ss   12:50   0:00 -bash

This is useful for identifying which processes are doing what.

Key Points

The smaller your job, the faster it will schedule.

previous episode

Introduction to High-Performance Computing

next episode

Using resources effectively

Overview

Estimating required resources using the scheduler

Read the documentation

Measuring the statistics of currently running tasks

Connecting to Nodes

Monitor system processes with `top`

`ps`

Key Points

previous episode

next episode

previous episode

Introduction to High-Performance Computing

next episode

Using resources effectively

Overview

Estimating required resources using the scheduler

Read the documentation

Measuring the statistics of currently running tasks

Connecting to Nodes

Monitor system processes with top

ps

Key Points

previous episode

next episode

Monitor system processes with `top`

`ps`