Systems Management Overhaul for High Energy Physics: Research IT’s Role

What is the High Energy Physics Group (HEP)?

The High Energy Physics (HEP) group at the University of Manchester is a research team engaged in both experimental and theoretical particle physics. They collaborate on major international projects, particularly at the Large Hadron Collider (LHC) including the ATLAS and LHCb experiments. These use advanced distributed computing infrastructure to support their work.

HEP maintains two key computing facilities:

A Tier 3 compute facility for local research use.
A Tier 2 distributed computing facility that forms part of the UK GridPP computational grid, which connects to the Worldwide LHC Computing Grid.

Together, these systems make up the Blackett facility, which includes around 500 physical and virtual machines and around 13 petabytes of storage, supporting the vast data and processing needs of high-energy physics research.

The challenges of managing Tier 3 and Tier 2 systems.

System management is done by the Blackett computing team. Ongoing improvements — including storage replacement and GridPP development — have been placing additional strain on limited support resources. The main tools used for system management include Foreman for provisioning, Puppet (which will soon transition to OpenVox) for configuration management, and Nagios and Prometheus for monitoring.

While some may doubt that it’s possible for one or two staff to manage such systems, Platforms Engineer Dave Love from Research IT has extensive experience managing similarly sized compute clusters solo, and knew that the right tools and automation significantly ease the burden of limited staffing.

The solution

Funding from GridPP to manage and deploy computing resources contained a network monitoring element. HEP approached the Research IT Platform Team to collaborate on making improvements to systems management tooling and automation. Dave Love was assigned and is helping with general systems management, managing hardware failures, elements of user support, and user management.

Dave and the HEP team:

Installed Perfsonar to meet GridPP network connectivity requirements.
Upgraded Puppet to the latest supported version — a non-trivial task given the extensive local configuration and lack of backwards compatibility in Puppet.
Improved monitoring systems, with a focus on the new Ceph storage, which has brought new challenges.
Enabled remote access to the Tier 3 cluster via GlobalProtect — a long-desired enhancement.

A productive collaboration

Whilst providing some knowledge and techniques to the HEP team, Dave benefited from learning about HEP’s work on automation. IT Services has benefited from adapting the HEP work on ambient temperature monitoring to improve resilience in satellite machine rooms — helping to detect occasional air conditioning failures early. Future HEP work is anticipated to provide automated draining and proactive shutdown of compute nodes on such failures; according to Dave “the laws of life say these happen out of work hours!”. That is a feature Dave learnt should be in place in machine rooms after a near-catastrophic meltdown in a previous job.

The collaboration continues, including evaluation of next-generation tools as possible replacements for Nagios and Prometheus. Additionally, Dave acknowledges the excellent support of the University’s Datacentre team for hardware maintenance.

Research IT supporting your research

Research IT is ready to support your research with a wide array of services, including (but not limited to) advanced systems management, tailored infrastructure solutions, research software engineering, data storage and analysis, training, and off-the-shelf research software.

Contact the team via their Connect form, we will be delighted to hear from you.