Addressing Critical Security Issues in Compute Systems

The grid systems and security

A previous item covered Dave Love’s system management work with the High Energy Physics computing group. In this article Dave reports on the team’s response to critical security vulnerabilities, which might be of interest to others running multi-access compute systems. It assumes some familiarity with Linux (the kernel).

To recap, Manchester HEP run a ‘Tier 2 node’ — the most productive in the UK! — of the HEP world-wide computing ‘grid’. Users submit compute jobs which might end up at Manchester from sites around the world. The site can’t easily be targeted, as jobs mostly follow where data they need are distributed, but it’s still a significant risk overall. While we hope real users are responsible types, with their research/PhD at risk, accounts compromised by stolen credentials are a concern. Jobs have no more restriction than a logged-in user on such a ‘multi-access’ system, and run under a generic account per experiment, so there’s no direct way to pin them to a specific user.

Security vulnerabilities are critical if they provide Local Privilege Escalation, where a normal user can somehow become root because of bugs, at least if there’s a known ‘proof of concept’ (‘PoC’) — effectively ‘exploit’. While the grid security people require fixes for critical vulnerabilities within a number of days, of course, we want to address vulnerabilities quickly.

Tackling a nasty vulnerability, then a spate!

So, it was unpleasant to see an advisory on a Thursday night about the Copy-fail LPE vulnerability. The rest of the team might have got more warning before a grid security team advisory arrived, but the forwarded original advisory was bounced by the email system as ‘contains malware’; moral, don’t email such things!

Copy-fail was particularly problematic on our Alma Enterprise Linux systems derived from Red Hat’s RHEL. EL has the kernel driver module at issue built in, not loaded on demand as in other distributions, so we couldn’t mitigate the vulnerability by preventing it loading.

The suggested mitigation was to add a kernel parameter and reboot. However, we didn’t want to declare down time to block new, possibly malicious, jobs and drain the cluster, let alone kill running jobs as some did. Although we can do rolling updates, it’s relatively painful and may take days to drain some nodes, so we lose significant compute time.

So, what to do? Blocking the module in a running kernel suggests using eBPF. Thinking someone might have got there already, we found a suitable mitigation. After testing, it went into the Puppet configuration, and all the compute nodes were protected around two hours after we started discussing it, without down time, and the local ‘tier 3’ cluster got the same fix automatically.

Somewhat to our surprise, there was no signature of the exploit in our central logging, so we didn’t need to rebuild nodes, also no sign subsequently.

Subsequently several LPE variations on the copy-fail mechanism, and unrelated LPEs,were announced late-ish on successive Thursdays, and the first Friday with nothing to do was a nice surprise! Fortunately those issues were mitigated by blocking unused kernel modules, except one needing temporarily disabling ptrace, hoping jobs don’t need the debugger.

Later we updated to a kernel fixing most of the LPEs, and blocked the one labelled CVE-2026-46243, fixed by a later kernel. CVE-2026-46243 would have been problem for which we’d have needed seek a workaround if we had Kerberos-authenticated CIFS mounts.

Though it isn’t useful when you want to avoid reboots, Alma have very quickly released kernels in the testing channel with copy-fail and the others fixed, further justifying using Alma.

AI, reducing attack surface, and preparedness

The main reason these vulnerabilities piled up is that they were found, or at least aided, by AI, especially as variations on a theme. We anticipate plenty more critical ones after the current lull. (The threat from AI-derived exploits is more general, particularly with the ability to chain exploits at a higher level to break in to sites.)

As system managers — proactive, not reactive ‘system admins’ — we want to forestall similar problems by reducing the ‘attack surface’ on the worker nodes exposed to untrusted jobs, and do anything else we can to counter classes of security issues. Some of that has been in place for a while, like disabling network namespaces, which aren’t needed and frequent source of vulnerabilities.

A characteristic of most recent vulnerabilities is occurring in drivers we wouldn’t expect to use. It’s not tractable to maintain /etc/modprobe.d(5) rules to block everything we don’t want, as there’s no whitelist facility, though we could follow Ubuntu at least aliasing out some protocols. [See the man page to avoid confusion; blacklist isn’t usually what you want.] Instead, we’ve identified what we need on worker nodes and now have a one-shot service which prunes loaded modules, adds some that may be needed later due to races in starting services, and blocks further module (un-)loading with the kernel.modules_disabled sysctl. Not ideal, but the best we can do without modifying kmod(8).

We need at least to be more comfortable wielding tools like eBPF in future. Otherwise, we could consider kernel live patches after experience with an older implementation, but they need modules loading, which we’ve blocked; similarly for SystemTap on EL systems, and it may not even be possible to build them for particular bugs.

We should consider the operating system native security features, principally SELinux, though it isn’t effective against vulnerabilities like copy-fail. We only run packaged system software, which appears fine with SELinux, but there’s still reluctance to enable it after old experience of breakage. Unfortunately, turning off SELinux is typically the first thing people (are advised to) do when installing compute clusters, rather than fixing any problems it causes.

Research IT

The grid systems and security

Tackling a nasty vulnerability, then a spate!

AI, reducing attack surface, and preparedness