CSC Getting Started Guide

Helpful PBS Hints

When writing PBS batch scripts, there are two important concepts to keep in mind. First, it is essential that you select the correct queue for your job. Second, it’s important to match your PBS resource request to the actual characteristics of your program.

Selecting the Correct PBS Queue

All of the cluster systems have both general purpose and special project queues. The general purpose queues, which all users can submit to, are speedq, friendlyq, and workq. Special project queues, such as reservedq and expressq, are controlled by system administrators. The queue configuration frequently varies on each system, so to view available queues use the command:

qstat -q

In general, the queues are prioritized to run jobs from special project queues first followed by jobs from general queues. User-submitted jobs in speedq have highest priority, followed by friendlyq, and then by workq. To prevent any single user from monopolizing the system, the higher priority queues impose resource restrictions, only accept jobs of a certain size, and limit the number of concurrently executing jobs. When submitting a job to PBS, it is important to select the queue most appropriate for your job.

The queue speedq is intended for debugging parallel programs, such as during the annual High Performance Scientific Computing class. Because it’s painful to have to wait for the queues when debugging code, speedq has a very high priority. However, speedq accepts only jobs that run on less than 8 nodes in less than 10 minutes. This allows a high throughput of small jobs with low wait time.

The queue friendlyq is intended for general purpose computing. To be runnable through friendlyq, a job must use less than 16 nodes and complete in less than 24 hours. In addition, the number of single-processor jobs allowed for each user is dynamically throttled to divide the cluster among the currently active users.

Jobs that don’t fit in speedq or friendlyq must be placed in the default queue workq. Large parallel jobs have the highest priority, while jobs with long walltimes have the lowest priority. Each user is allowed to have only one or two jobs executing from workq at any given time.

For users with large parallel jobs, workq is the ideal queue because large jobs have priority and the cluster could only be able to run one or two jobs at a time anyway. However, if your job does not fit in friendlyq only because of large walltime requirements, then it’s not a friendly job. If your job cannot run in less than 24 hours, consider implementing a checkpoint and restart feature! Some supercomputing centers, such as NCAR, limit all jobs to 6 hours with no exceptions.

Requesting Resources

The most important concept to remember when working with PBS and MPI parallel programs is that PBS provides resources, but mpirun configures how the resources are used.

In almost all cases, the best configuration is a 1:1 mapping between MPI tasks and real processors. For example, if you want 20 MPI processes, then you should request 20 CPUs on 10 nodes:

#PBS -l nodes=10:ppn=2,ncpus=20
...
mpirun -np 20 ./program

In this case, the #PBS line instructs PBS to allocate 20 CPUs on 10 nodes. After PBS reserves these nodes, mpirun starts the program, placing 20 tasks on the 20 processors, providing the ideal 1:1 mapping of tasks to processors.

When performing timing measurements, achieving a 1:1 task to processor mapping is essential. However, it is possible to request fewer processors than are really required. This is known as oversubscribing. For example, the 20 task program could be run on 5 nodes:

#PBS -l nodes=5:ppn=2,ncpus=10
...
mpirun -np 20 ./program

This PBS request allocates 5 nodes and 10 CPUs. The mpirun command, however, still starts 20 tasks. Because there are more tasks than processors, mpirun is forced to place 2 tasks on every processor. The cluster node operating system then timeshares the processor between both running tasks, giving each task 50% of the available CPU time.

While oversubscribing may seem like a useful technique to get jobs running on a heavily utilized system, jobs that require more than a few minutes to execute should not be run in an oversubscribed fashion: they just take twice as long as they would have with the right processor allocation.

In the opposite direction, it is possible to undersubscribe nodes. In this case, a job uses only one of two CPUs on a node for processing and leaving the other CPU idle. There are several reasons to do this. For example, some jobs require more memory than would be available with two tasks running per node. Similarly, during important timing runs, you can prevent other people from using the other CPU in one of your nodes by reserving both processors in the node.

To undersubscribe a node, simply request one node (and both CPUs) for each process:

#PBS -l nodes=20:ppn=2,ncpus=40
...
mpirun -np 20 ./program

In this example, PBS reserves 20 nodes and both CPUs per node. The mpirun directive only starts 20 tasks, so one task is started on each node. The other processor in each node is left idle.

In most cases, a 1:1 task to processor mapping works best. When adjusting your mpirun command lines, make sure to check the PBS resource requirements so they match your intent.

‹ prev | 1 2 3 4 5 6 7 8 | next ›