LZA Summary

John Kormylo


General purpose (text) data compression is based on two basic approaches. The first uses statistics to convert bytes (characters) into variable length codes, so that more frequently used bytes are coded with fewer bits. Huffman and Arithmetic are the two best algorithms, with Arithmetic being slightly better but slower. The second approach (pioneered by Lemple and Ziv) generates a dictionary to convert long strings of bytes into a dictionary index/length token. ZIP and LZH use a combination of both approaches (LZA stands for Lemple, Ziv & Arithmetic).

Sometimes a string of bytes can occur very frequently. If you use a binary search tree to store your dictionary, all such strings will appear in consecutive locations. So instead of pointing to one specific string in the group, LZA points to the entire group. As it happens, Arithmetic compression works by coding things as a range of (index) values anyway.

A FIFO dictionary is used to store strings until their entire length (65 bytes maximum) has been decoded. Not only does this simplify the tree maintainance for the main dictionary, the probabilty of finding a match within the FIFO buffer is very high, improving compression. Also, the probability varies depending on how recently the string was added, and this information is used to adjust the length of the index codes.

Similarly, shorter strings are more likely than longer strings, so these counts are used to code length information as well.

The first code in the token is used to signal whether a single byte is being sent, or whether the FIFO or main dictionaries will be used. For both dictionaries, the second code signals the length, and the third selects the string(s).

Actually, LZA uses a "layered" binary search tree. This consists of a series of binary search trees linked together, where each tree searches over single byte (no string compares). This requires a variable number of node structures, with the worst case being twice the number of dictionary entries.

Rather than using straight adaptive coding for the characters, a one byte header is used to signal the number of different characters used. A special "new character" token determines which of the remaining (256) possible characters to use. The overhead for unused characters depends on how many are left, so that there is NO overhead after the last character is added. This results in a significant savings on text files (which use about 80 out of 256), while producing only a 1 byte overhead on binary files.