# Supplement on Parity and Error Correcting RAM

### Parity Memory vs. ECC Memory

Sometimes, when you write a byte of data to RAM and later read it back, the eight bits that come back are not all identical to those you wrote. RAM is more reliable now than a few years ago, but when you have a multi-user system, it pays to avoid system hangs and crashes.

• With plain memory, errors are undetected and corrupt the results until they propagate to the point that the application or operating system crashes, by which time bad data may well have been stored to disk.

• With parity memory, single-bit or triple-bit errors are detected immediately (crashing the application or operating system before corrupting on-disk data) but double-bit or quadruple-bit errors are not detected.

How is this done? For every eight bits of data written to RAM, the RAM subsystem hardware computes a ninth ("parity") bit and stores it along with the eight data bits. For example, if using so-called "odd parity," the ninth bit will be given the value of one if there are an even number of bits already set to a value of one in the eight data bits. Changing any one data bit will change the computed value for the parity bit.

When the RAM subsystem sends data back, it re-computes the parity bit from the eight data bits it read, and compares that with the parity bit it read. If the two agree, it proceeds. If the two disagree, then it knows that one of the nine bits is wrong, and it signals the CPU that the data are not valid. (One time out of nine, on the average, it will be the parity bit itself that is wrong, but most of the time it is one of the eight data bits that is wrong.)

```Original Data and Computed Parity

01101100 1

there are four data bits with value 1,
so the parity is 1 to give an odd number of bits set

Recovered Data and Parity

01111100 1
^
one bit in the data has changed!

Re-computed Parity

0

there are five bits with value 1 in the recovered data so the
re-computed parity is 0 to leave an odd number of bits set

```

Because the re-computed parity does not agree with the recovered parity, we know that an error has occurred, but we don't know which bit changed. Depending on the system design, and on whether the byte being read was data or program code, this may crash the system or the application, or it may just result in an error message on the screen, but with functioning parity memory and system software designed to notice it, there will be some notification of the error. If two bits change, the re-computed parity will match the recovered parity, and the bad data will be accepted with no immediate error notification, although there may later be a mysterious problem.

• With Error-correcting code (ECC) memory, as usually implemented, all single-bit errors are corrected on the fly and all double-bit errors (and many multiple-bit errors) are detected and reported (crashing the operating system or application). ECC memory achieves this by storing, for example, 7 extra ("redundancy") bits with each 32 bits of data. By using the correct algorithm to calculate those redundancy bits, all one-bit changes will be identified, and since these are bits, with values of zero and one, if one changed, you can immediately know what it used to be, thereby permitting correction to be done by the electronics. Sufficient redundancy to permit single-bit error correction will automatically permit double-bit error detection, but will not permit double-bit error correction. (That would require more redundancy than 39 bits to store 32 bits worth of data.)

If the probability of an error is one in a hundred quadrillion, and if the memory system is running at 10 MHz (100 nanosecond), and if you have 125 Megabytes of RAM (1 billion bits), then you would expect on average to see one single-bit error every ten seconds and one double-bit error every thousand quadrillion seconds (somewhat more than the age of the universe). That is why ECC memory is worth using, and why it is designed to detect but not correct double-bit errors.

The above calculation of the probability of double-bit errors is optimistic, in that it assumes that the errors are all "statistically independent," that is, that there will not be single events that cause simultaneous multiple-bit errors. For example, a failure of the tiny wires (inside the integrated circuit chip's carrier) that connect the DC power from the circuit board to the chip itself will cause all of the bits stored on that chip to fail. By allocating the various bits of each byte to different chips, it is possible to reduce the vulnerability of the RAM to such errors.

Modern RAM chips store each bit as a small electric charge (or the absence of a small electric charge). Ionizing radiation resulting from cosmic rays or the radioactive decay of trace contaminants of the chip or its surrounding carrier can alter the stored value. High voltages, whether resulting from static electricity during improper handling or from transient events such as lightening strikes nearby, can also damage integrated circuits, either permanently or temporarily.

### Error Correction Codes

Repeating each bit four times is the error correction code that is simplest to describe that can detect and correct single-bit errors, and can detect double-bit errors without confusing them with single-bit errors. For example:

```Data:      0110

Encoded:   0000111111110000
```
If any one bit changes, there is no question as to the original value, so it is possible to report the correct value for each bit automatically. If, on the other hand, two bits change within the same group of four, then you cannot tell by inspection which two have changed, and so you cannot tell what the correct value is, but you can tell that something is wrong.

This is a very expensive coding, requiring four times as much physical RAM as the data itself. Sophisticated mathematical analysis demonstrates that much cheaper approaches are possible. Real ECC memory uses a much less expensive encoding, using 39 bits to encode 32, to provide just enough redundancy to detect double-bit errors and to correct single-bit errors.