The Application Support team in Research IT is a dedicated team of research computing specialists who offer support and guidance to researchers to help them set up, debug, optimise and execute their research software (amongst other things!). Recently the team was contacted by a researcher from the Department of Chemistry with an interesting challenge involving the use of GPUs and the accelerating power that they can provide.
What Are GPUs?
Graphical Processing Units (GPUs) are now used for far more than rendering graphics. Containing hundreds of times (at least) more processing units of a traditional processor, GPUs allow advanced users to perform massively parallel simulations, sometimes reducing the time required to perform analyses by several orders of magnitude.
However, it is not a simple job to move complex (or sometimes even trivial) computations to properly and efficiently use GPUs. Even though more and more scientific libraries are developing support for GPUs these often come with tacit 'buyer beware' status.
The Computational Shared Facility (CSF) is a powerful and constantly updating High Performance Computing (HPC) cluster at the University of Manchester. In addition to a large number of traditional processors, it is also equipped with several modern and powerful GPUs. The CSF supports many libraries with (often nascent) GPU support, but these are not always easy for scientific users to operate, especially on less familiar HPC systems.
In this case, the Application Support team was approached by a researcher from the School of Chemistry, hoping to perform simulations using the spin dynamics simulation engine Spinach, which now supports the use of GPU processing.
Oliver Woolland (RSE) supported this effort, initially identifying several small issues, but a conflict between the configuration of the CSF and design of Spinach required more significant thought and investigation.
The Initial Investigation
The initial issues included small misconfigurations of the researcher’s batch submission script. With access granted by the researcher, Oliver was able to debug these scripts and ensure they were working correctly.
Next, an issue was uncovered when the Spinach application complained it was unable to produce any output files. This was due to Spinach assuming it could write to the CSF's global 'scratch' directory, an area for short term storage of codes and data, which is regularly cleaned up.
This is a sensible place for an application to write output but, on the CSF, the output should be directed to the user's own personal scratch area, not the global area. Luckily Spinach allows users to customise the location it writes to, and this was easily solved in the researcher's simulation script.
With these issues solved the researcher's script was still unable to run properly. Errors were thrown which said the researcher did not have permission to use the CSF’s GPUs, even though they had the correct accounts and permissions.
Deeper Investigation
To resolve the issue, the design of Spinach had to be understood. Oliver began to read the technical documentation and source code of Spinach to gain an understanding of what the library was doing, and why this might be failing on the CSF.
This investigation produced an interesting result. Significantly, Spinach begins its processing by launching a master thread (single process), which it would launch on a GPU if the user had requested GPU based execution. To utilise the acceleration and parallelism that GPUs can bring, Spinach's master thread then launches many child threads which can perform the work of the simulation.
This solution is neat and will work perfectly on most systems, especially GPUs which users use and manage themselves. However, this becomes more untrue when working on a shared system like the CSF.
This approach (master thread launching child threads) conflicts fundamentally with the configuration of the CSF. The CSF is set up with its GPUs set up in “exclusive process” mode, allowing only one process per GPU. When Spinach launches its child threads, exclusive process mode denies the action, and the job fails (hence the permissions errors).
This may seem obtuse but this setting was chosen to protect the CSF's users, ensuring only one user is using each GPU at a time, eliminating resource conflicts which might crash researcher's jobs!
To find a solution, Oliver reached out to the creator of Spinach for guidance. They confirmed that Oliver's understanding of how Spinach works was correct and that that approach would conflict with an exclusive process configuration.
The Solution
While finding and identifying the problem took some effort, amazingly, the ultimate solution was easy to implement with a workaround for this situation now published in the CSF's extensive documentation.
The solution was the Nvidia Multi-Process Service (MPS) which allows processes to spawn children without these counting as additional (external) processes while still retaining user's exclusive use of the GPU. Implementing this service resolved this issue and the researcher was then able to run their simulations, now accelerated by the GPUs!
Get in touch!
The Apps team is able to help with a wide variety of issues and problems so get in touch to discuss your requirements through the dedicated form in Knowledge Base.