Huffman coding · Thought Toys

If some letters show up far more than others, it's wasteful to spend the same number of bits on each. Give the common ones short codes and the rare ones long codes, and the message shrinks. Huffman's trick finds the best possible such code — and there's a hard floor it can approach but never cross.

What you're seeing

Eight letters, A–H, each needing a code of 0s and 1s. Read a letter's code off the tree: start at the top and follow the branches down to it — left is a 0, right is a 1. Because every letter is a leaf (nothing hangs below it), no code is the start of another, so a stream of bits splits back into letters with no commas needed. That's a prefix code.

Huffman builds the tree from the bottom up by a stunningly simple rule: repeatedly join the two least-common things into one. Do that until everything is joined, and the letters you use most end up near the top with the shortest codes, while the rare ones sink to the bottom with long ones. Drag the dial: when every letter is equally likely there's nothing to exploit — the best code is just a flat 3 bits each, no better than numbering them. Skew the frequencies and the common letters' codes collapse to one or two bits, and the average code length drops. (Real text is wildly skewed like this — in English an e shows up roughly a hundred times more often than a z — which is exactly why squeezing it works.)

The amber bar is that average, in bits per letter. It can slide left as you skew — but watch the dashed line. That's the entropy floor: the true amount of information the source carries. The bar can crowd right up to it and never cross. Push the skew to the extreme and a strange thing shows: the floor drops below 1 bit, yet the bar sticks near 1 — because a whole-bit code can't spend half a bit on a letter. That gap is exactly why arithmetic coding exists, and why you can't compress already-random data: there's no floor left to fall to.

The rule, exactly. Each letter i with probability p_i gets a code of length ℓ_i; the average is L = Σ p_i ℓ_i, bounded by H ≤ L < H+1, H = −Σ p_i log₂ p_i where H is the Shannon entropy — the floor. Verified in node (improve/verify/25-huffman.js): the codes are always a valid prefix set (Kraft sum Σ2^−ℓ = 1 exactly across 20,000 random distributions); L always sits in [H, H+1); a dyadic source like ½, ¼, ⅛, ⅛ hits the floor exactly (L = H = 1.75); and at full skew L ≈ 1.12 bits beats a flat 3-bit code by 63%. Counter-example: hand the most frequent letters the longest codes and the average balloons (6.96 vs Huffman's 1.21) — matching codes to frequency is the whole trick; and trying to give all eight letters 2-bit codes is simply impossible (Kraft sum 2 > 1), so the floor can't be cheated.

← the cabinet · Thought Toys — a cabinet of explorable explanations. Exhibit 25.