The Quest for a Perfect Dictionary Format
                    The Miminum RAM Solution
                         by John Kormylo
                         E.L.F. Software

For a spell checker to achieve any sort of reasonable  speed,  it 
must  be  RAM  resident.   (In computer  years,  disk  I/O  takes 
forever.)  When I began this project,  I had only 4M bytes of RAM 
installed on my computer.   Obviously,  a RAM resident dictionary 
would have to be efficient.

Also,  to implement a browse function with a slide bar,  one must 
be  able to assign an index to each word and locate the  word  by 
its index.  Given an index, one can then store grammar, thesaurus 
or language translation information in arrays.

There  is much to be said for a simple array of string  pointers.  
One can use binary search to locate words,  and the word index is 
simply the array index for the string pointer.   The penalty  for 
this  approach is only 4 bytes/word (3 if you eliminate the  NULL 
string terminators).

You  can improve upon this by using Huffman data  compression  on 
the text block.   Of course, now the string pointers will have to 
work on a bit level rather than byte level.

Further improvement can be achieved by using a hash table for the 
first few characters.  Since the starting characters are used many 
times,  there  is  a substantial savings by removing  them  (even 
compressed).   However,  as the hash table gets larger the saving 
drop rapidly,  so eventually one will return to a smaller segment 
of text over which to perform binary searches.

If the hash table continues until every text segment is less than 
8K  bytes in length (64K bits),  one can use shorts for  the  bit 
pointers (relative to the start of the  segment).   However,  for 
each  segment you will need a long bit pointer for the  start  of 
the  text,  and  a long pointer (or index) for the start  of  the 
short bit pointers.

The  hash table was implemented using the "packed hash" method  I 
had developed.   However,  because this hash table was terminated 
after only a few characters, there were not enough (or any) "only 
children" to fill in the gaps.   Consequently,  the hash was  not 
perfect.  Nonetheless, the results were impressive.

My  source  text  was a 1,628,241 byte  file  containing  161,127 
words,  including  proper names with upper  case  letters.   This 
produced a dictionary only 833,451 bytes in size.   It used  3235 
compressed  text  segment (binary search) and a  4154  node  hash 
table (8 bytes/node).   A perfect hash would have only taken 3752 
nodes, for 10.7% wasted space in the hashing.

Note,  the reason for the 8 byte hash nodes is the inclusion of a 
long  count  into the structure,  in order  to  compute  starting 
(and ending) index values for each text segment.