From the perspective of a traditional x86 computer enthusiast – or professional – mainframes are strange, archaic beasts. They are physically massive, energy consuming and expensive compared to more traditional data center equipment, and generally offer less compute power per rack at a higher cost.
This begs the question, “Then why keep using mainframes?” Once you wave over the cynical answers that boil down to “because we’ve always done it that way,” the practical answers largely boil down to reliability and consistency. As AnandTech’s Ian Cutress points out in a speculative piece focusing on Telum’s redesigned cache, “the downtime of these [IBM Z] systems is measured in milliseconds per year.” (If true, at least that is seven nines.)
IBM’s own announcement of the Telum shows how different the priorities of mainframe and commodity computing are. It casually describes Telum’s memory interface as “capable of tolerating full channel or DIMM failures and designed to transparently recover data without impacting response time.”
When you pull a DIMM off a live, running x86 server, that server doesn’t recover “transparent data” – it just crashes.
IBM Z Series Architecture
Telum was designed as sort of a single chip to rule them all for mainframes, replacing a much more heterogeneous setup in previous IBM mainframes.
The 14 nm IBM z15 CPU that Telum replaces has a total of five processors: two pairs of 12-core compute processors and one system controller. Each Compute Processor hosts 256 MiB L3 cache shared between its 12 cores, while the System Controller hosts a whopping 960 MiB L4 cache shared between the four Compute Processors.
Five of these z15 processors, each consisting of four computer processors and one system controller, form a ‘drawer’. Four drawers come together in a single z15-powered mainframe.
While the concept of multiple processors to a tray and multiple trays to a system persists, the architecture within Telum itself is radically different – and significantly simplified.
Telum is a bit simpler at first glance than z15 was: it’s an eight-core processor built on Samsung’s 7nm process, with two processors combined on each package (similar to AMD’s chiplet approach to Ryzen). There is no separate system controller processor – all Telum processors are identical.
From here, four Telum CPU packages combine to make one four socket “tray”, and four of those drawers go into a single mainframe system. This gives a total of 256 cores on 32 CPUs. Each core runs on a base clock speed over 5 GHz – providing more predictable and consistent latency for real-time transactions than a lower base with a higher turbo speed.
Bags full of cache
By doing away with the central system processor on every package, Telum’s cache also had to be redesigned – the huge 960MiB L4 cache is gone, as is the per-die shared L3 cache. In Telum, each individual core has its own 32MiB L2 cache – and that’s it. There is no hardware L3 or L4 cache at all.
This is where things get really weird – while each Telum core’s 32MiB L2 cache is technically private, it’s really only virtual private. When a line is deleted from the L2 cache of one core, the processor looks for empty space in the L2 of the other cores. If it finds one, it will remove the deleted L2 cache rule from the core X is tagged as an L3 cache rule and stored in core Yesis L2.
OK, so we have a virtual, shared L3 cache of up to 256 MiB on each Telum processor, composed of the 32 MiB “private” L2 cache on each of its eight cores. From here things take things one step further: that 256 MiB shared “virtual L3” on each processor can in turn be used as shared “virtual L4” between all processors in a system.
Telum’s “virtual L4” works much the same way its “virtual L3” did in the first place: displaced L3 cache lines from one processor find a home on another processor. If another processor in the same Telum system has free space, the deleted L3 cache line is re-tagged as L4 and lives instead in the virtual L3 on the other processor (which consists of the “private” L2s of its eight cores) .
AnandTech’s Ian Cutress takes a closer look at Telum’s cache mechanisms. He finally sums them up by answering, “How is this possible?” with a simple “magic”.
AI inference acceleration
Telum is also introducing a 6TFLOPS on-die inference accelerator. It is intended to be used for real-time fraud detection, among other things during financial transactions (as opposed to shortly after the transaction).
In the quest for maximum performance and minimum latency, IBM is sticking several needles. The new inference accelerator will be placed on-die, allowing for lower latency connections between the accelerator and the CPU cores, but it is not built into the cores themselves, a la Intel’s AVX-512 instruction set.
The problem with in-core inference acceleration like Intel’s is that it typically limits the AI processing power available to any single core. A Xeon core with an AVX-512 instruction has only the hardware in its own core at its disposal, which means that larger inference tasks must be distributed across multiple Xeon cores to extract the full performance available.
Telum’s accelerator is on-die but off-core. This allows a single core to run inference workloads with the power of the whole on-die accelerator, not just the part built into itself.
List image by IBM