I spent a little time today thinking about how to include more than one aprun command in a PBS batch script using python, and I thought I'd write down what I figured out, both for my own sake, and in case anyone else is wondering. And, additionally, this provides an excellent opportunity to give a little insight into the glamourous life of a computational scientist.
Beginning with the beginning, a PBS batch script is a script you use to start a job on a cluster or supercomputer which uses the PBS batch scheduler. Essentially, a supercomputer is just a bunch of more or less normal computers, called nodes, hooked up with really, really fast network connections, so in principle, you could just log in to one or more of those machines, and start your job manually. However, this would essentially be an anarchy, and particularly users who wanted to use large portions of the machine for really large jobs might never find enough free nodes. Thus, someone invented the queue system. Though in reality, it didn't happen in that order at all. Single node shared machines was the way things worked in the old days, so queue systems are older than what we today think of as a supercomputer. According to wikipedia
, they're known as batch systems because they processed batches of punchcards.
In any case, on a supercomputer there is a batch system running. When you want to start a job, you create a job script, and submit this script to the queue system. I don't know if this is generally true, but at least in the case of PBS, job scripts can be written in any interpreted language that uses # to begin a comment. While they're typically written as shell scripts, it's sometimes more convenient to use python (or perl, I guess) if you want more advanced logic.
The job script contains information about how many nodes and cores you need, as well as how long your job will take, and the batch system uses this information to allocate time and resources in some more or less fair manner, while at the same time aiming to keep utilisation of the machine as close to 100% as possible. In addition to information about time and resources, the job script needs to contain the command or commands neccessary for starting your job, and that's the topic of today's article. I'll just dive in, and present an example job script.
#PBS -N MassiveCalculation
#PBS -l mppwidth=128
#PBS -l walltime=3:00:00
#PBS -A budget
command = 'aprun -n 32 -N 32 executable %s'
parameters = [1, 2, 3, 4]
processes = 
for parameter in parameters:
folder = 'parameter_%s' % (parameter)
os.system('mkdir %s' % folder)
print 'Running job in %s' % os.getcwd()
p = subprocess.Popen(command % parameter, shell = True)
for process in processes:
The first line is known as the hashbang, and it's purpose is to tell the operating system that this is indeed a python script and should be interpreted by the python interpreter.
Next up are five PBS directives. These are not a part of python, and is the reason you have to use a language where # signifies a comment. What they do, in order, is to set the name of the job to MassiveCalculation, request 128 cores, set the maximum time to three hourse, import the environment from the user who submitted the job, and finally charge the job to the budget named budget. There are others as well, and which ones you need will depend on the local settings.
Next, we import two python modules, os
which allows us to talk to the operating system, and subprocess
which allows us to start subprocesses from within the python script.
When the batch system actually gets around to running this script, the current directory is no longer going to be the one you where in when you submitted the job (at least not in general), so we use the os
module to move to that directory. The path is available in the handy environment variable PBS_O_WORKDIR
, which we can also access via the os
Moving on, we define a string which holds the command that will actually start our program, named executable. aprun
is the interface responsible for starting parallel jobs, and here we tell it that we want 32 cores, and 32 jobs per node. In my case, I'm running this on HECToR, which has 32 cores per node, so I want one full node, with one MPI task per core. The final %s
is a placeholder for the argument we will pass to the executable later on. I also create an empty list, which will hold references to the subprocesses we are going to start.
Now we are getting to the reason I'm using python for this, which is that I want to loop over a set of parameters, starting one job for each parameter. Of course, in this particular case, the parameters are just a list of integers, and bash could have handled that equally well, but in my real problem I'm doing some floating point arithmetic and other things I don't know how to do in bash. So for each parameter in the list named parameters, I create a folder named parameter_1
, etc., move into that folder, print a message saying I'm about to run the job in the current folder, and then I run the job as a subprocess. The shell = True
argument means the command is run just as it would have been if I had typed it into a shell. I then add a reference to the subprocess to the list of processes, move back to the original director, and repeat for the other parameters.
Finally, we need to wait for all the processes to finish before exiting, otherwise the jobs will be killed when the batch script is done. I do this by looping over the list of subprocesses, and calling wait()
, which will return immediately when the job is done. When all the jobs are done, so is the batch script, and everything is joyous bliss.
So why am I doing this? Couldn't I just have started these four jobs separately, with four separate batch scripts? Well, yes, and that would actually often be a smarter thing to do. When you run this script, you will be charged for 128 cores, for as long as the job script is running, which means that if one of your jobs takes significantly longer than the others, you are getting charged for time you don't use. In my case, however, I know that all my jobs take almost exactly the same amount of time. The reason I'm doing this, though, is because there are separate queues on HECToR, based on runtime and number of nodes, and the queue for the smallest jobs (four nodes or less) is typically the busiest one. Each queue has a certain maximum number of jobs that can be running at the same time, and each user has a limit on the number of jobs you can have in the queue at the same time, so by combining my jobs I can move up into a less busy queue for jobs that demand more nodes, and thus get more efficient turnaround.
There are of course loads of ways to do what I'm doing, and I'm not sure if this is the best or most elegant way, but it seems to work for me.