This Monday, Linux kernel maker Linus Torvalds ranted in frustration about the lack of Error Correcting Checksum (ECC) RAM in consumer PCs and laptops.
… the misguided and backward policy of “consumers don’t need an ECC”, [made] the market for ECC memory is disappearing.
The arguments against ECC have always been complete nonsense. Now even the memory manufacturers are starting to do ECC internally because they finally recognized that it is an absolute must.
If you’re not familiar with ECC RAM, it’s probably because you don’t build or specify dedicated servers with server-grade CPUs and motherboards – which, unfortunately, is about the only place you’ll actually find ECC. In a nutshell, ECC RAM contains a small amount of additional memory that is used for error detection and correction.
Memory Errors and Probability
In most modern implementations, this means that for every 64-bit word stored in RAM, there are eight control bits. A single bit error – a 0 that converts to 1 or a 1 that converts to 0 – can be both automatically detected and corrected. Two bits flipped in the same word can be detected but not corrected. Three or more bits reversed in the same word will probably are detected, but detection is not guaranteed.
Bitflips can happen for many reasons, starting with cosmic rays or a simple hardware failure. A large-scale survey of Google servers found that approximately 32 percent of all servers (and 8 percent of all DIMMs) in Google’s fleet experience at least one memory failure per year. But the vast majority of these are single-bit errors – and since Google uses server CPUs and ECC RAM, that means the machines in question just keep running.
In consumer machines, even these single-bit errors – which are more than 40 times more common than multi-bit errors, according to Google’s data – go undetected and can lead to instability in systems and corruption in data.
Bitflips aren’t always accidental
Not every RAM failure is the result of a hardware failure or an unintended EMF problem. In recent years, researchers have developed increasingly practical physics-based side-channel attacks, which use controlled, fast bit flips in areas of RAM accessible to a single application to infer or deduce the values of data in adjacent RAM areas. change that they shouldn’t be able to.
While ECC RAM can’t attenuate RAMBleed-like attacks that infer the values of adjacent memory, it can generally stop Rowhammer attacks – where rapidly flipping bits in an area of RAM cause bits in an adjacent area to change.
Even when ECC cannot actively prevent a Rowhammer attack from having an impact on the system, e.g. when it flips several bits in one word, it can at least notify the system of the problem and in most cases prevent that from happening. the Rowhammer attack by doing something other than causing downtime. (Most ECC systems are configured to stop the entire machine if a fatal error is detected.)
Torvalds blames Intel
And the memory manufacturers claim it’s because of the economy and lower power consumption. And they are lying bastards – let me point out row-hammer again about how those problems have been going on for generations, but these f*ckers were happy to sell broken hardware to consumers claiming it was an “attack” when it always was . “we cut corners.”
How many times has a row hammer-like bit flip happened through sheer bad luck on real non-attack charges? We will never know. Because Intel was pushing nonsense to consumers.
Torvalds takes the bold stance that the lack of ECC RAM in consumer technology is Intel’s fault because of the company’s policy of artificial market segmentation. Intel has a vested interest in pushing companies with deeper pockets to its more expensive and profitable server-grade CPUs, rather than letting those entities effectively use the lower-margin consumer parts.
Removing support for ECC RAM from CPUs not directly targeting the server world is one of the ways Intel has kept those markets highly segmented. Torvalds’ argument here is that Intel’s refusal to support ECC RAM in its consumer-facing parts — along with its de facto near-monopoly in that space — is the real reason ECC is nearly unavailable outside of the server space.
The usual argument for why ECC is not present in consumer technology revolves around cost, but we suspect Torvalds has a right to it. While ECC RAM is essentially a hard-to-find specialty part, it typically costs only about 20 percent more per DIMM than retail non-ECC. The real problem is that without motherboards and CPUs that support it, it won’t do you any good.