In terms of compression efficiency, I knew there were some obvious places that could use improvement. In particular, my Huffman trees…weren’t even really Huffman trees. The intent was for them to be Huffman-like in that the most frequently seen symbols would be closest to the top of the tree and thus have the shortest bitstrings, but the construction and balancing method was completely different. Whenever a symbol’s count increased, I compared it to the parent’s parent’s other child, and if the current symbol’s count was now greater, it swapped it with the current symbol, inserted a new branch where the updated node used to be, and pushed the other child down a level.
Unfortunately, that method led to horribly imbalanced trees, since it only considered nearby nodes when rebalancing, when changing the frequency of a symbol can actually affect the relationship of symbols on relatively distant parts of the tree as well. As an example, here’s what a 4-bit length tree wound up looking like with my original adaptive method:
Lengths tree: Leaf node 0: Count=2256 BitString=1 Leaf node 1: Count=1731 BitString=001 Leaf node 2: Count=1268 BitString=0001 Leaf node 3: Count=853 BitString=00001 Leaf node 4: Count=576 BitString=000001 Leaf node 5: Count=405 BitString=0000001 Leaf node 6: Count=313 BitString=00000001 Leaf node 7: Count=215 BitString=000000000 Leaf node 8: Count=108 BitString=0000000011 Leaf node 9: Count=81 BitString=00000000101 Leaf node 10: Count=47 BitString=000000001001 Leaf node 11: Count=22 BitString=00000000100001 Leaf node 12: Count=28 BitString=0000000010001 Leaf node 13: Count=15 BitString=000000001000000 Leaf node 14: Count=9 BitString=000000001000001 Leaf node 15: Count=169 BitString=01 Avg bits per symbol = 3.881052
If you take the same data and manually construct a Huffman tree the proper way, you get a much more balanced tree without the ludicrously long strings:
Leaf node 0: Count=2256 BitString=10 Leaf node 1: Count=1731 BitString=01 Leaf node 2: Count=1268 BitString=111 Leaf node 3: Count=853 BitString=001 Leaf node 4: Count=576 BitString=1100 Leaf node 5: Count=405 BitString=0001 Leaf node 6: Count=313 BitString=11011 Leaf node 7: Count=215 BitString=00001 Leaf node 8: Count=108 BitString=000001 Leaf node 9: Count=81 BitString=000000 Leaf node 10: Count=47 BitString=1101000 Leaf node 11: Count=22 BitString=110100110 Leaf node 12: Count=28 BitString=11010010 Leaf node 13: Count=15 BitString=1101001111 Leaf node 14: Count=9 BitString=1101001110 Leaf node 15: Count=169 BitString=110101 Avg bits per symbol = 2.969368
That’s nearly a bit per symbol better, which may not sound like much but with the original method there was barely any compression happening at all, whereas a proper tree achieves just over 25% compression.
So, I simply dumped my original adaptive method and made it construct a Huffman tree in the more traditional way, pairing the highest count nodes in a sorted list. To keep it adaptive, it still does the count check against the parent’s parent’s other child, and when it crosses the threshold it simply rebuilds the entire Huffman tree from scratch based on the current symbol counts. This involves a lot more CPU work, but as we’ll see later, performance bottlenecks aren’t necessarily where you think they are…
My trees also differ from traditional ones in that they prepopulate the tree with all possible symbols with a count of zero, whereas usually you only insert nodes into a Huffman tree if they have a count greater than zero. This is slightly suboptimal, but it avoids a chicken-and-egg problem with the decoder not knowing what symbol a bitstring corresponds to if it doesn’t exist in the tree yet because it’s the first time the symbol has been seen.
Knowing that, and with the improved Huffman trees, another thing became clear: using Huffman trees for the offsets wasn’t really doing much good at all. With most files, the offset values are too evenly distributed, and many are never used at all, and all those zero-count entries would get pushed down the tree and become longer strings, so the first time an offset got used it would often have a string longer than its basic bit length, causing file growth instead of compression. I instead just ripped those trees out and emitted plain old integer values for the offsets.
The way I was constructing my trees also had another limitation: the total number of symbols had to be a power of two. With the proper construction method, an arbitrary number of symbols could be specified, and that allowed another potential optimization: merging the identifier tree and the literals tree. The identifier token in the output stream guaranteed that there would always be at least 1 wasted non-data bit per token, and often two. Merging it with the literals would increase the size of literal symbols, but the expectation is that the larger literal size would on average still be smaller than the sum of the identifier symbols and smaller literal symbols, on average, especially as more ‘special case’ symbols are added. Instead of reading an identifier symbol and deciding what to do based on that, the decoder would read a ‘literal’ symbol, and if it was in the range 0-255, it was indeed a literal byte value and interpreted that way, but if it was 256 or above, it would be treated as having a following offset/length pair.
The range of offsets to handle would also have to change, but that’s for next time… With the Huffman tree improvements, my 195kB test file that compressed to 87.4kB before now compressed to 84.2kB. Still not as good as gzip, but getting there.