Home Kate Morley on Mastodon

Compressing the Wordle dictionary

The popular word-guessing game Wordle is implemented using a 173KB JavaScript bundle, reduced to a 58KB transfer using Brotli compression. By efficiently compressing the dictionary, we can reduce the JavaScript bundle size by 36% to 111KB and the compressed transfer size by 26% to 43KB.

The dictionary

The Wordle JavaScript bundle contains a 10,657-word dictionary listing all of the five-letter words that Wordle will accept, accounting for 48% of the bundle size (83KB). The dictionary is stored as a simple JavaScript array:

1
2
3
4
5
6
7
8
9
Ta = [
  "aahed",
  "aalii",
  "aargh",
  // 10,651 more words...
  "zygon",
  "zymes",
  "zymic"
]

Text compression

The Brotli compression algorithm, like its predecessor gzip, can compress arbitrary text. The order of words matters when compressing most text, but in the case of an alphabetical list this information isn’t needed, as the words could have been stored in any order and later sorted into alphabetical order.

We can calculate how much information is required to represent the order of words. For a list of 10,657 unique words there are 10,657 × 10,656 × … × 3 × 2 × 1 possible permutations. Taking the binary logarithm of this number tells us that 127,219 bits are required to represent a permutation. This means we could potentially save over 15KB by only needing to represent one of the possible permutations.

Word differences

Rather than storing the words themselves, we can instead store the first word followed by the list of differences between consecutive words. This requires much less information to be stored if the words are arranged to reduce the differences between neighbouring words.

Arranging the words alphabetically is an effective way of reducing these differences, as neighbouring words usually share a common prefix — for example, cared, carer, cares, and caret differ only in the last letter.

These are the frequencies of each type of difference in the Wordle dictionary (where ⬜ represents a shared letter and ❌ represents a different letter):

Difference Frequency
⬜ ⬜ ⬜ ⬜ ❌ 2946
⬜ ⬜ ⬜ ❌ ⬜ 1022
⬜ ⬜ ⬜ ❌ ❌ 4040
⬜ ⬜ ❌ ⬜ ⬜ 43
⬜ ⬜ ❌ ⬜ ❌ 106
⬜ ⬜ ❌ ❌ ⬜ 358
⬜ ⬜ ❌ ❌ ❌ 1832
⬜ ❌ ⬜ ⬜ ❌ 1
⬜ ❌ ⬜ ❌ ⬜ 1
⬜ ❌ ⬜ ❌ ❌ 6
⬜ ❌ ❌ ⬜ ⬜ 1
⬜ ❌ ❌ ⬜ ❌ 14
⬜ ❌ ❌ ❌ ⬜ 43
⬜ ❌ ❌ ❌ ❌ 218
❌ ❌ ⬜ ❌ ⬜ 2
❌ ❌ ❌ ⬜ ❌ 1
❌ ❌ ❌ ❌ ⬜ 4
❌ ❌ ❌ ❌ ❌ 18

From this table, we can see that 85% of the time a difference in one letter means all of the following letters differ as well. This isn’t surprising, as the first difference between two words determines their alphabetical order and the following letters will only match by chance.

Encoding suffixes

If we assume that a difference in one letter means all of the following letters differ, we only need to store the change to the suffix. We don’t need to store the position of the letters as this is determined by the length of the suffix — for example, a three-letter suffix will replace the final three letters of the preceding word. We gain more by not storing the positions than we lose due to the suffix occasionally including letters than didn’t differ.

These are the changed suffixes for the ten words following the first word (aahed) in the Wordle dictionary:

Word Suffix
aalii lii
aargh rgh
aarti ti
abaca baca
abaci i
abacs s
abaft ft
abaka ka
abamp mp
aband nd

When storing the suffixes we need to be able to tell where one suffix ends and another begins. A simple way to do this is just to capitalise the first letter of each suffix. For the ten words above, this results in the string LiiRghTiBacaISFtKaMpNd. This suffix string is 56% shorter than the total length of the original words — 22 characters in comparison to 50 characters.

Decoding suffixes

To decode the suffix string back into the original word list, we initialise the word list to the first word, split the suffix string before capital letters, and then loop over the suffixes, adding each reconstructed word to the end of the word list:

1
2
3
4
5
6
7
8
Ta=['aahed']

for (let s of 'LiiRgh…OnMesIc'.split(/(?=[A-Z])/)) {
  Ta.push(
    Ta[Ta.length - 1].substring(0, 5 - s.length)
    + s.toLowerCase()
  )
}

The full version of this code can be found in suffix-decoder.js (21,460 bytes).

Size comparison

Encoding the dictionary as suffixes is so efficient that even without Brotli compression it’s smaller than the Brotli-compressed version of the original dictionary:

Dictionary Original Improved Saving
Uncompressed 83KB 21KB 75%
Compressed 28KB 14KB 50%

We still see gains from Brotli-compressing the suffix-encoded dictionary because Brotli produces an efficient binary encoding.

Because the dictionary forms such a large fraction of the JavaScript bundle, the improvements are still significant once the suffix-encoded dictionary is substituted into the bundle:

Full bundle Original Improved Saving
Uncompressed 173KB 111KB 36%
Compressed 58KB 43KB 26%

Significantly, the 15KB reduction in the compressed size matches the potential gains expected from ignoring the order of words.