The Quest for a Perfect Dictionary Format The Miminum RAM Solution by John Kormylo E.L.F. Software For a spell checker to achieve any sort of reasonable speed, it must be RAM resident. (In computer years, disk I/O takes forever.) When I began this project, I had only 4M bytes of RAM installed on my computer. Obviously, a RAM resident dictionary would have to be efficient. Also, to implement a browse function with a slide bar, one must be able to assign an index to each word and locate the word by its index. Given an index, one can then store grammar, thesaurus or language translation information in arrays. There is much to be said for a simple array of string pointers. One can use binary search to locate words, and the word index is simply the array index for the string pointer. The penalty for this approach is only 4 bytes/word (3 if you eliminate the NULL string terminators). You can improve upon this by using Huffman data compression on the text block. Of course, now the string pointers will have to work on a bit level rather than byte level. Further improvement can be achieved by using a hash table for the first few characters. Since the starting characters are used many times, there is a substantial savings by removing them (even compressed). However, as the hash table gets larger the saving drop rapidly, so eventually one will return to a smaller segment of text over which to perform binary searches. If the hash table continues until every text segment is less than 8K bytes in length (64K bits), one can use shorts for the bit pointers (relative to the start of the segment). However, for each segment you will need a long bit pointer for the start of the text, and a long pointer (or index) for the start of the short bit pointers. The hash table was implemented using the "packed hash" method I had developed. However, because this hash table was terminated after only a few characters, there were not enough (or any) "only children" to fill in the gaps. Consequently, the hash was not perfect. Nonetheless, the results were impressive. My source text was a 1,628,241 byte file containing 161,127 words, including proper names with upper case letters. This produced a dictionary only 833,451 bytes in size. It used 3235 compressed text segment (binary search) and a 4154 node hash table (8 bytes/node). A perfect hash would have only taken 3752 nodes, for 10.7% wasted space in the hashing. Note, the reason for the 8 byte hash nodes is the inclusion of a long count into the structure, in order to compute starting (and ending) index values for each text segment.