Research IT

Top Tips for the UoM Condor Pool

This latest instalment of our Top Tips blog posts is sure to be of interest to users of our popular Condor Pool. Ian Cottam, Research Software Engineer (RSE) in Research IT, has put together some handy tips for you to get the most out of this great computational resource.

Tip 1. Use macros and command line arguments to make your submit scripts more versatile and save on having to edit them between runs.

Here are two examples. First, if you are using the new feature of cloud bursting from our local Condor Pool to Amazon Web Services (AWS), an extra line is needed in your submit script. But, rather than edit all your scripts for jobs that you want to permit cloud bursting, you can just add it as an argument to when you run condor_submit, like so:

  condor_submit -a "+MayUseAWS=True" submit.txt

The effect of the -a argument is equivalent to it being added just before the Queue line in the submit script. For more information see our webpage on cloud bursting.

Relatedly, many Condor users edit their scripts to define the number of jobs to queue up; e.g. Queue 1000. Typically you need a small number for testing, say 1 to 5, and then when all looks good the full batch. To save edits, put this in your scripts:

    Queue $(qnum)

If qnum is not defined, it defaults to the empty string, and "Queue" is the same as "Queue 1". Again, use the -a argument to condor submit, e.g.:

    condor_submit -a "qnum=3" submit.txt

condor_submit -a "qnum=1000" submit.txt

Multiple -a arguments are allowed.

Tip 2. Some Condor Pool users are unsure what the H (Held) state means. Here is what it means and a tip for fixing things.

The Held state means something went wrong after your job was started, preventing Condor from completing it. At the time of writing some 1200 jobs are in this state. You, not the system, have to find and fix the problem. The simplest way is to use the -- recently restored to health -- status page.

If your job or jobs are Held, click on the link to your username and then on down until you reach the job's details. There you will see the reason for the job being Held. A common one is getting the name or letter case of a filename wrong. Remove the Held jobs with condor_rm, fix the problem and simply resubmit.

Although the above tips are written in Condor syntax, it is likely that similar can be done with other HPC/HTC systems here and externally.

If you have any questions about using our Condor Pool then please get in touch.