The Quest for a Perfect Dictionary Format The Perfect Hash by John Kormylo E.L.F. Software I was never entirely happy with the minimum RAM dictionary format. I would have preferred one made entirely with hash tables, but the results were rather large compared to the source text. However, I eventually hit on a viable solution. The problem with a pure hash solution is that the hash table takes up to 12 bytes/node (using longs), and almost every character at the end of words requires a new node. The key to the solution is a "suffix" hash. If you eliminate all the characters from the start of every word which are used in common with any other word (plus one), you are left with what can be described as "suffix" strings. For example, "quackery" and "quacking" gives you "ry" and "ng". If you reverse the strings ("yr" and "gn") and hash them, you have a "suffix" hash. Applying this to the 1,628,241 byte source file used previously produced 213,573 bytes of suffix text (not counting NULL terminators), which was hashed using 17,078 nodes. The second clever part was implementing the suffix hash using only 2 bytes/node. One byte holds the character and the other holds an unsigned relative parent index (0 means end of word). This means that every child node must be within 255 array locations of its parent. As it happens, this occurs most of the time, but occasionally some nodes will have to appear more than once to handle all of their children. Nodes without parents can go anywhere, so you always start with one of them. Then always use the child node (of a parent within 255 array locations) with the smallest number of descendants (children, grandchildren, etc.). That way if you have to restart you will have eliminated as many nodes as possible beforehand. In the above example, the hash table was implemented using an array of 17,121 nodes, restarting 34 times with a total of 43 duplicate nodes (0.25% waste). More to the point, it represented 213,573 bytes of text using only 34,242 bytes of hash, or 84% compression. Hashing the rest of the text was done using two different types of hash tables, one using longs and the other using shorts. The characters were mapped to values between 1 and N (no optimization, just bypassing some punctuation). Each node of the "long hash" is implemented as an array of longs. The first entry holds the count and the remaining N entries hold indexes for the next node for the corresponding mapped characters (1...N). For a table consisting of M nodes, values above M indicate an index (plus M) into the "short hash" array. The long hash table is used until the count drops below 32K (and the number of nodes needed is less than 64K). As it turned out, only one long node was needed. Fortunately, there is no real cost for using the more general formulation. The "short hash" is a packed hash using unsigned shorts for the parent and base, and a signed short for the count. These indexes are relative to the starting entry (whose absolute index comes from the "long hash"). Each hash table is limited to 64K nodes, but the total number of nodes can (and will) be much larger. The sign bit of the count is used to indicate that this is a valid word termination (terminating NULLs are not hashed). A count of 1 (or 0x8002) indicates that the base value is an index into the suffix hash. The example dictionary used 52 short hash tables, for a total of 229,656 nodes (4416 nodes/table average). Every hash table produced a perfect hash (no wasted nodes). The final dictionary used a total of 1,412,480 bytes. While this is much larger than the minimum RAM solution, it is still smaller than the original source text. Nor does it put much of a dent into the RAM of a modern computer. There is no faster method for looking up words than a perfect hash. (Computing indexes is a little slower.) The only important size limitation on this format is that the suffix hash must use less than 64K nodes.