Research associate Mohammed Elasrag from Prof Bernard Keavney's group (Division of Cardiovascular Sciences) current research involves the development of more accurate tools for the interpretation of genetic variants (the differences in DNA sequences between individuals within a population). An aspect of this work involved running 1 million matrices in the form of 1 million separate R jobs with the aim to simulate patterns of covariation.
Mohammed initially started to run the 1 million R jobs on the University Computational Shared Facility (CSF) but unfortunately was not getting the desired throughput. Mohammed estimated the job would take several months to complete so he contacted Research IT for further advice.
A small team of Research Infrastructure Engineers (RIEs) consisting of Chris Grave, Daniel Corbett and Chris Paul, looked at the issue and suggested using the University’s HTCondor service where jobs can burst into the Cloud from the Condor pool. Unfortunately, due to the sensitivity of data, the job was not a suitable candidate for this method. Therefore It was determined that all processing needed to be carried out using HTCondor compute resources located on campus.
The RIEs worked closely with Mohammed to get him up and running with his research. For example, this involved helping him to organise a million input files into a sensible hierarchy to avoid putting unnecessary strain on the associated filesystems and the campus network. The job itself involved several additional R packages which were installed by Research IT on to the nodes prior to the job starting. Finally they advised Mohammed on a sensible input script. All of the above helped to maximise job throughput.
The job itself was one of largest ever run on HTCondor, taking approximately 7 weeks (600,590 CPU hours) to complete. Running this on the CSF would have taken several more months.
If you think the HTCondor pool would be suitable for your workload or if you would like to find out more, please get in touch!