start page | rating of books | rating of authors | reviews | copyrights

Unix Power ToolsUnix Power ToolsSearch this book

16.4. Inside spell

[If you have ispell (Section 16.2), there's not a whole lot of reason for using spell any more. Not only is ispell more powerful, it's a heck of a lot easier to update its spelling dictionaries. Nonetheless, we decided to include this article, because it clarifies the kinds of rules that spellcheckers go through to expand on the words in their dictionaries. -- TOR]

On many Unix systems, the directory /usr/lib/spell contains the main program invoked by the spell command along with auxiliary programs and data files.

On some systems, the spell command is a shell script that pipes its input through deroff -w and sort -u ( Section 22.6) to remove formatting codes and prepare a sorted word list, one word per line. On other systems, it is a standalone program that does these steps internally. Two separate spelling lists are maintained, one for American usage and one for British usage (invoked with the -b option to spell). These lists, hlista and hlistb, cannot be read or updated directly. They are compressed files, compiled from a list of words represented as nine-digit hash codes. (Hash coding is a special technique used to search for information quickly.)

The main program invoked by spell is spellprog. It loads the list of hash codes from either hlista or hlistb into a table, and it looks for the hash code corresponding to each word on the sorted word list. This eliminates all words (or hash codes) actually found in the spelling list. For the remaining words, spellprog tries to derive a recognizable word by performing various operations on the word stem based on suffix and prefix rules. A few of these manipulations follow:

-y+iness +ness -y+i+less +less -y+ies -t+ce -t+cy

The new words created as a result of these manipulations will be checked once more against the spell table. However, before the stem-derivative rules are applied, the remaining words are checked against a table of hash codes built from the file hstop. The stop list contains typical misspellings that stem-derivative operations might allow to pass. For instance, the misspelled word thier would be converted into thy using the suffix rule -y+ier. The hstop file accounts for as many cases of this type of error as possible.

The final output consists of words not found in the spell list -- even after the program tried to search for their stems -- and words that were found in the stop list.

You can get a better sense of these rules in action by using the -v or -x option. The -v option eliminates the last look-up in the table and produces a list of words that are not actually in the spelling list, along with possible derivatives. It allows you to see which words were found as a result of stem-derivative operations and prints the rule used. (Refer to the sample file in Section 16.1.)

% spell -v sample
Alcuin
ditroff
LaserWriter
PostScript
printerr
TranScript
+out  output
+s    uses

The -x option makes spell begin at the stem-derivative stage and prints the various attempts it makes to find the stem of each word.

% spell -x sample
...
=into
=LaserWriter
=LaserWrite
=LaserWrit
=laserWriter
=laserWrite
=laserWrit
=output
=put
...
LaserWriter
...

The stem is preceded by an equals sign (=). At the end of the output are the words whose stem does not appear in the spell list.

One other file you should know about is spellhist. On some systems, each time you run spell, the output is appended through tee (Section 43.8) into spellhist, in effect creating a list of all the misspelled or unrecognized words for your site. The spellhist file is something of a "garbage" file that keeps on growing: you will want to reduce it or remove it periodically. To extract useful information from this spellhist, you might use the sort and uniq -c (Section 21.20) commands to compile a list of misspelled words or special terms that occur most frequently. It is possible to add these words back into the basic spelling dictionary, but this is too complex a process to describe here. It's probably easier just to use a local spelling dictionary (Section 16.1). Even better, use ispell; not only is it a more powerful spelling program, it is much easier to update the word lists it uses (Section 16.5).

-- DD



Library Navigation Links

Copyright © 2003 O'Reilly & Associates. All rights reserved.