The Quest for a Perfect Dictionary Format
                        The Perfect Hash
                         by John Kormylo
                         E.L.F. Software

I  was  never  entirely happy with  the  minimum  RAM  dictionary 
format.   I  would  have preferred one made  entirely  with  hash 
tables,  but the results were rather large compared to the source 
text.  However, I eventually hit on a viable solution.

The  problem  with a pure hash solution is that  the  hash  table 
takes  up  to  12 bytes/node  (using  longs),  and  almost  every 
character at the end of words requires a new node.

The key to the solution is a "suffix" hash.  If you eliminate all 
the  characters  from the start of every word which are  used  in 
common with any other word (plus one), you are left with what can 
be described as "suffix" strings.   For example,  "quackery"  and 
"quacking" gives you "ry" and "ng".   If you reverse the  strings 
("yr" and "gn") and hash them, you have a "suffix" hash.

Applying  this to the 1,628,241 byte source file used  previously 
produced  213,573  bytes  of  suffix  text  (not  counting   NULL 
terminators), which was hashed using 17,078 nodes.

The  second  clever part was implementing the suffix  hash  using 
only  2 bytes/node.   One byte holds the character and the  other 
holds  an unsigned relative parent index (0 means end  of  word).  
This  means  that  every  child node must  be  within  255  array 
locations of its parent.   As it happens, this occurs most of the 
time,  but occasionally some nodes will have to appear more  than 
once to handle all of their children.

Nodes without parents can go anywhere,  so you always start  with 
one of them.   Then always use the child node (of a parent within 
255  array  locations) with the smallest  number  of  descendants 
(children, grandchildren, etc.).  That way if you have to restart 
you will have eliminated as many nodes as possible beforehand.

In  the above example,  the hash table was implemented  using  an 
array  of 17,121 nodes,  restarting 34 times with a total  of  43 
duplicate nodes (0.25% waste).  More to the point, it represented 
213,573  bytes of text using only 34,242 bytes of  hash,  or  84% 
compression.

Hashing  the rest of the text was done using two different  types 
of hash tables,  one using longs and the other using shorts.  The 
characters   were   mapped  to  values  between  1  and   N   (no 
optimization, just bypassing some punctuation).

Each node of the "long hash" is implemented as an array of longs.  
The first entry holds the count and the remaining N entries  hold 
indexes for the next node for the corresponding mapped characters 
(1...N).   For  a  table consisting of M nodes,  values  above  M 
indicate an index (plus M) into the "short hash" array.  The long 
hash  table  is  used until the count drops below  32K  (and  the 
number of nodes needed is less than 64K).

As it turned out,  only one long node was  needed.   Fortunately, 
there is no real cost for using the more general formulation.

The  "short hash" is a packed hash using unsigned shorts for  the 
parent and base, and a signed short for the count.  These indexes 
are  relative to the starting entry (whose absolute  index  comes 
from the "long hash").   Each hash table is limited to 64K nodes, 
but the total number of nodes can (and will) be much larger.  The 
sign  bit of the count is used to indicate that this is  a  valid 
word termination (terminating NULLs are not hashed).   A count of 
1 (or 0x8002) indicates that the base value is an index into  the 
suffix hash.

The example dictionary used 52 short hash tables,  for a total of 
229,656  nodes  (4416 nodes/table  average).   Every  hash  table 
produced a perfect hash (no wasted nodes).   The final dictionary 
used a total of 1,412,480 bytes.   While this is much larger than 
the minimum RAM solution,  it is still smaller than the  original 
source  text.   Nor does it put much of a dent into the RAM of  a 
modern computer.

There  is  no faster method for looking up words than  a  perfect 
hash.    (Computing  indexes  is  a  little  slower.)   The  only 
important size limitation on this format is that the suffix  hash 
must use less than 64K nodes.