LZW is a simple compression algorithm – certainly the simplest, most deterministic I could find in my investigations. And here I use ‘deterministic’ in the sense that there are no implementation-dependent decisions, ambiguities, smarts or heuristics. LZW has only one real decision to make: dictionary size. And for my application, which requires highly deterministic compression, I’m willing to fix the dictionary size.
But LZW has a weakness: after the dictionary is filled, no new patterns are recognized. And the dictionary, at this point, is very poorly utilized: there are many not useful patterns.
There are existing approaches to mitigate this weakness, e.g. including a ‘clear code’ to reset the dictionary, or compressing chunks at a time. However, these tend to require some heuristic decisions that undermine LZW’s wonderful determinism. They also tend to lose old but valid patterns.
In this article, I propose a simple alternative, garbage collect the dictionary with an exponential decay factor.
A preliminary (but flawed) approach:
- To every element in the extended dictionary, add a match count.
- Increment match count on match (even when we don’t output the token).
- When dictionary is full, output current match token & garbage collect.
- To garbage collect (GC), remove zero-match elements & cut match counts in half
LZW is quite optimistic when growing its dictionary. In practice, we can expect that many entries go unmatched, and thus will be available for GC. LZW attempts to grow its dictionary after every output token. Thus, we should expect to perform a GC roughly every dictionary-size outputs. The exponential decay factor results from cutting down match counts in half, thus favoring newer patterns over older ones.
The flaw of this approach is that not all patterns have equal opportunity to find matches. Information about patterns discovered just before GC is more likely to be lost. This flaw can be mitigated by incremental GC. We can GC exactly one token just before every allocation. In a round-robin fashion, we can halve the match count of tokens until we find one with zero count. That’s the one token we collect.
This can potentially result in unusual cases where we ‘match’ patterns that never existed in the original input, i.e. because the dictionary has some intermediate nodes that were collected. This is unlikely but okay; it doesn’t hurt anyone. The dictionary doesn’t change between outputs, and we’ll always be able to decode any token the way the compressor meant it.
To decode, for each input token we recursively increase ‘match count’ for that token and all its dependencies. We also perform a round-robin GC after every input, and keep track of the free token. It shouldn’t be very sophisticated. We’re simply reproducing the same dictionary used to generate the output.
This extended form of LZW, which I’ll just call LZW-GC, is very similar in some ways to sliding window compression algorithms. But I think the dictionary is simpler, especially on the encode. Theoretically, LZW-GC should be superior to a sliding window of equivalent size because LZW-GC can make effective use of older patterns that are long gone from the window. To complete LZW-GC, we should follow it with a Huffman encoding to compress frequent tokens. This is independent of GC at the LZW layer.
Where’s the Code? The code is available on github, along with compression results.
It turns out that LZW-GC is about the same as LZW in most use cases, but significantly better for a few very, very large inputs. I’ll need to explore some ways to speed up dictionary construction, such as adapting LZAP (a variant of LZW).