While sorting through some old files of mine, I happened upon the source code to a compression engine I’d written 18 years ago. It was one of the first things I’d ever written in C++, aside from some university coursework, and I worked on it in the evenings during the summer I was in Cold Lake on a work term, just for fun. Yes, I am truly a nerd, but there wasn’t really much else to do in a tiny town like that, especially when you only get 3 TV channels.
Looking at it now it’s kind of embarrassing, since of course it’s riddled with inexperience. No comments at all, leaving me mystified at what some of the code was even doing in the first place, unnecessary global variables, little error checking, poor header/module separation, unnecessary exposure of class internals, poor const correctness, and so on. It kind of irks my pride to leave something in such a poor state though, so I quickly resolved to at least clean it up a bit.
Of course, I have to understand it first, and I started to remember more about it as I looked over the code. It’s a fairly basic combination of both LZ77 pattern matching and Huffman coding, like the ubiquitous Zip method, but the twist I wanted to explore was in making the Huffman trees adaptive, so that the symbols would shift around the tree to automatically adjust as their frequency changed within the input stream. There were two parameters that controlled compression efficiency: history buffer size, and maximum pattern length. The history size controls how far back it would look for matching patterns, and the length controlled the upper limit on the length of a match that could be found.
Compression proceeded by moving through the input file byte by byte, looking for the longest possible exact byte match between the data ahead of the current position and the data in the history buffer just behind the current position. If a match could not be found, it would emit the current byte as a literal and move one byte ahead, and if a match was found, it would emit a token with the offset and length of the match in the history buffer. To differentiate between these cases, it would first emit an ‘identifier’ token with one of four possible values: one for a literal, which would then be followed by the 8-bit value of the literal, and three for offset and length values, with three different possible bit lengths for the offset so that closer matches took fewer bits. Only matches of length 3 or longer are considered, since two-byte matches would likely have an identifier+offset+length string longer than just emitting the two bytes as literals. In summary, the four possible types of bit strings you’d see in the output were:
| ident 0 | 8-bit literal | | ident 1 | 'x'-bit offset | length | | ident 2 | 'y'-bit offset | length | | ident 3 | 'z'-bit offset | length |
And then I used a lot of Huffman trees. Each of these values were then run through a Huffman tree to generate the actual symbol emitted to the output stream, with separate trees for the identifier token, the literals, the lengths, and the three offset types. HUFFMAN TREES EVERYWHERE! The compression parameters were also written to a header in the file, so the decoder would know what history buffer size to use and maximum length allowed.
It worked…okay… I’ve lost my original test files, but on one example text file of 195kB, my method compresses it down to 87.4kB, while ‘gzip -9’ manages 81.1kB. Not really competitive, but not too bad for a completely amateur attempt either. There’s still plenty of room for improvement, which will come…next time.
Ahh, progress! Ya know, I’d hate to have a symbol emitted in *my* output stream. Sounds uncomfortable, and possibly indecent — not that I’m terribly decent as it is… :-)