ispell(5) - Man pages

dwww Home | Manual pages | Find package
ISPELL(5)                     File Formats Manual                    ISPELL(5)

NAME
       ispell - format of ispell dictionaries and affix files

DESCRIPTION
       Ispell(1)  requires  two files to define the language that it is spell-
       checking.  The first file is a dictionary containing words for the lan-
       guage,  and  the  second is an "affix" file that defines the meaning of
       special flags in the dictionary.  The two files are combined by  build-
       hash  (see ispell(1)) and written to a hash file which is not described
       here.

       A raw ispell dictionary (either the main dictionary or  your  own  per-
       sonal  dictionary)  contains  a list of words, one per line.  Each word
       may optionally be followed by a slash ("/")  and  one  or  more  flags,
       which  modify  the  root word as explained below.  Depending on the op-
       tions with which ispell was built, case may or may not  be  significant
       in  either the root word or the flags, independently.  Specifically, if
       the compile-time option CAPITALIZATION is defined, case is  significant
       in  the  root  word;  if not, case is ignored in the root word.  If the
       compile-time option MASKBITS is set to a value of 32, case  is  ignored
       in the flags; otherwise case is significant in the flags.  Contact your
       system administrator or ispell maintainer for more information (or  use
       the -vv flag to find out).  The dictionary should be sorted with the -f
       flag of sort(1) before the hash file is built; this is  done  automati-
       cally  by  munchlist(1), which is the normal way of building dictionar-
       ies.

       If the dictionary contains words that have string characters  (see  the
       affix-file  documentation  below),  they  must be written in the format
       given by the defstringtype statement in the affix file.  This  will  be
       the  case  for most non-English languages.  Be careful to use this for-
       mat, rather than that of your favorite formatter, when adding words  to
       a  dictionary.  (If you add words to your personal dictionary during an
       ispell session, they will automatically be  converted  to  the  correct
       format.   This  feature  can be used to convert an entire dictionary if
       necessary:)

                   echo qqqqq > dummy.dict
                   buildhash dummy.dict affix-file dummy.hash
                   awk '{print "*"}END{print "#"}' old-dict-file \
                   | ispell -a -T old-dict-string-type \
                     -d ./dummy.hash -p ./new-dict-file \
                     > /dev/null
                   rm dummy.*

       The case of the root word controls the case of words  accepted  by  is-
       pell, as follows:

       (1)    If the root word appears only in lower case (e.g., bob), it will
              be accepted in lower case, capitalized, or all capitals.

       (2)    If the root word appears capitalized (e.g., Robert), it will not
              be  accepted in all-lower case, but will be accepted capitalized
              or all in capitals.

       (3)    If the root word appears all in capitals (e.g., UNIX),  it  will
              only be accepted all in capitals.

       (4)    If  the  root  word appears with a "funny" capitalization (e.g.,
              ITCorp), a word will be accepted only if it follows  that  capi-
              talization, or if it appears all in capitals.

       (5)    More  than  one  capitalization of a root word may appear in the
              dictionary.  Flags from different capitalizations  are  combined
              by OR-ing them together.

       Redundant  capitalizations  (e.g.,  bob  and  Bob)  will be combined by
       buildhash and by ispell (for personal dictionaries), and can be removed
       from a raw dictionary by munchlist.

       For example, the dictionary:

              bob
              Robert
              UNIX
              ITcorp
              ITCorp

       will  accept  bob,  Bob, BOB, Robert, ROBERT, UNIX, ITcorp, ITCorp, and
       ITCORP, and will reject all others.  Some of the unacceptable forms are
       bOb, robert, Unix, and ItCorp.

       As  mentioned  above,  root  words in any dictionary may be extended by
       flags.  Each flag is a single alphabetic character, which represents  a
       prefix or suffix that may be added to the root to form a new word.  For
       example, in an English dictionary the D flag can be added to  bathe  to
       make bathed.  Since flags are represented as a single bit in the hashed
       dictionary, this results in significant space savings.   The  munchlist
       script will reduce an existing raw dictionary by adding flags when pos-
       sible.

       When a word is extended with an affix, the affix will be accepted  only
       if  it  appears in the same case as the initial (prefix) or final (suf-
       fix) letter of the word.  Thus, for example, the entry  UNIX/M  in  the
       main dictionary (M means add an apostrophe and an "s" to make a posses-
       sive) would accept UNIX'S but would reject UNIX's.  If UNIX's is legal,
       it  must appear as a separate dictionary entry, and it will not be com-
       bined by munchlist.  (In general, you don't need to worry  about  these
       things; munchlist guarantees that its output dictionary will accept the
       same set of words as its input, so all you have to do is add  words  to
       the dictionary and occasionally run munchlist to reduce its size).

       As  mentioned,  the affix definition file describes the affixes associ-
       ated with particular flags.  It also describes the character  set  used
       by the language.

       Although  the  affix-definition grammar is designed for a line-oriented
       layout, it is actually a free-format yacc grammar and can be  laid  out
       weirdly if you want.  Comments are started by a pound (sharp) sign (#),
       and continue to the end of the line.  Backslashes are supported in  the
       usual  fashion (\nnn, plus specials \n, \r, \t, \v, \f, \b, and the new
       hex format \xnn).  Any character with special meaning to the parser can
       be  changed  to an uninterpreted token by backslashing it; for example,
       you can declare a flag named 'asterisk' or 'colon'  with  flag  \*:  or
       flag \::.

       The grammar will be presented in a top-down fashion, with discussion of
       each element.  An affix-definition file must contain exactly one table:

              table     :    [headers] [prefixes] [suffixes]

       At least one of prefixes and suffixes is required.  They can appear  in
       either order.

              headers   :    [ options ] char-sets

       The  headers  describe  options global to this dictionary and language.
       These include the character sets to be used and the formatter, and  the
       defaults for certain ispell flags.

              options   :    { fmtr-stmt | opt-stmt | flag-stmt | num-stmt }

       The options statements define the defaults for certain ispell flags and
       for the character sets used by the formatters.

              fmtr-stmt :    { nroff-stmt | tex-stmt }

       A fmtr-stmt describes characters that have special meaning to a format-
       ter.  Normally, this statement is not necessary, but some languages may
       have preempted the usual defaults for use as language-specific  charac-
       ters.   In this case, these statements may be used to redefine the spe-
       cial characters expected by the formatter.

              nroff-stmt :   { nroffchars | troffchars } string

       The nroffchars statement allows redefinition of certain  nroff  control
       characters.  The string given must be exactly five characters long, and
       must list substitutions for the left and right parentheses ("()") , the
       period  ("."), the backslash ("\"), and the asterisk ("*").  (The right
       parenthesis is not currently used, but is included  for  completeness.)
       For example, the statement:

              nroffchars {}.\\*

       would  replace the left and right parentheses with left and right curly
       braces for purposes of parsing nroff/troff strings, with no  effect  on
       the  others  (admittedly a contrived example).  Note that the backslash
       is escaped with a backslash.

              tex-stmt  :    { TeXchars | texchars } string

       The TeXchars statement allows redefinition of certain TeX/LaTeX control
       characters.  The string given must be exactly thirteen characters long,
       and must list substitutions for the left and right parentheses ("()") ,
       the  left  and  right  square brackets ("[]"), the left and right curly
       braces ("{}"), the left and right angle brackets ("<>"), the  backslash
       ("\"),  the  dollar  sign  ("$"), the asterisk ("*"), the period or dot
       ("."), and the percent sign ("%").  For example, the statement:

              texchars ()\[]<\><\>\\$*.%

       would replace the functions of the left and right curly braces with the
       left  and  right  angle brackets for purposes of parsing TeX/LaTeX con-
       structs, while retaining their functions for the tib bibliographic pre-
       processor.   Note  that the backslash, the left square bracket, and the
       right angle bracket must be escaped with a backslash.

              opt-stmt  :    { cmpnd-stmt | aff-stmt }

              cmpnd-stmt :   compoundwords compound-opt

              aff-stmt  :    allaffixes on-or-off

              on-or-off :    { on | off }

              compound-opt : { on-or-off | controlled character }

       An opt-stmt controls certain ispell defaults that are  best  made  lan-
       guage-specific.   The allaffixes statement controls the default for the
       -P and -m options to ispell.  If allaffixes  is  turned  off  (the  de-
       fault),  ispell will default to the behavior of the -P flag: root/affix
       suggestions will only be made if there are no "near misses".  If allaf-
       fixes is turned on, ispell will default to the behavior of the -m flag:
       root/affix suggestions will always be made.  The  compoundwords  state-
       ment controls the default for the -B and -C options to ispell.  If com-
       poundwords is turned off (the default), ispell will default to the  be-
       havior  of  the -B flag: run-together words will be reported as errors.
       If compoundwords is turned on, ispell will default to the  behavior  of
       the -C flag: run-together words will be considered as compounds if both
       are in the dictionary.  This is useful for languages such as German and
       Norwegian,  which  form  large  numbers of compound words.  Finally, if
       compoundwords is set to controlled, only words marked with the flag in-
       dicated  by  character (which should not be otherwise used) will be al-
       lowed to participate in compound formation.  Because  this  option  re-
       quires the flags to be specified in the dictionary, it is not available
       from the command line.

              flag-stmt :    flagmarker character

       The flagmarker statement describes the character which is used to sepa-
       rate  affix  flags  from  the root word in a raw dictionary file.  This
       must be a character which is not found in any word (including in string
       characters;  see  below).  The default is "/" because this character is
       not normally used to represent special characters in any language.

              num-stmt  :    compoundmin digit

       The compoundmin statement controls the length of the two components  of
       a compound word.  This only has an effect if compoundwords is turned on
       or if the -C flag is given to ispell.  In  that  case,  only  words  at
       least  as long as the given minimum will be accepted as components of a
       compound.  The default is 3 characters.

              char-sets :    norm-sets [ alt-sets ]

       The character-set section describes the characters that can be part  of
       a word, and defines their collating order.  There must always be a def-
       inition of "normal" character sets;  in addition, there may be  one  or
       more  partial definitions of "alternate" sets which are used with vari-
       ous text formatters.

              norm-sets :    [ deftype ] [ set-options ] charset-group

       A "normal" character set may optionally begin with a definition of  the
       file  suffixes  that  make  use of this set.  Following this are one or
       more character-set declarations.

              deftype   :         defstringtype name deformatter suffix*

       The defstringtype declaration gives  a  list  of  file  suffixes  which
       should make use of the default string characters defined as part of the
       base character set; it is only necessary if string characters are being
       defined.  The name parameter is a string giving the unique name associ-
       ated with these suffixes; often it is a formatter name.  If the format-
       ter  is  a  member  of the troff family, "nroff" should be used for the
       name associated with the most popular macro package; members of the TeX
       family  should  use  "tex".  Other names may be chosen freely, but they
       should be kept simple, as they are used in ispell 's -T switch to spec-
       ify  a  formatter type.  The deformatter parameter specifies the defor-
       matting style to use when processing files  with  the  given  suffixes.
       Currently, this must be plain, tex, nroff, or html.  The suffix parame-
       ters are a whitespace-separated list of strings which,  if  present  at
       the end of a filename, indicate that the associated set of string char-
       acters should be used by default for this file.  For example, the  suf-
       fix  list  for  the  troff  family  typically includes suffixes such as
       ".ms", ".me", ".mm", etc.

              set-options :  options charset-options*

       The options declaration activates one or more  white-separated  options
       for the given character set (default or alternate).  Currently, two op-
       tions are supported: The raw_display option indicates that string char-
       acters  should  be displayed as-is even if some of their components ap-
       pear to be non-printing; this option is useful for character sets  such
       as  UTF-8 or (if the terminal is configured appropriately) ISO Latin-1.
       The squeeze_string option indicates that  when  ispell  is  interacting
       with an external client such as emacs (via the -a flag), string charac-
       ters should be considered to be of length  1  rather  than  their  true
       length  in bytes.  This option is needed to allow ispell to synchronize
       with emacs when processing files containing UTF-8 characters; it should
       only be given for UTF-8 character sets.

              charset-group :     { char-stmt | string-stmt | dup-stmt}*

       A  char-stmt describes single characters; a string-stmt describes char-
       acters that must appear together as a string, and which usually  repre-
       sent  a  single  character in the target language.  Either may also de-
       scribe conversion between upper and lower case.  A dup-stmt is used  to
       describe alternate forms of string characters, so that a single dictio-
       nary may be used with several formatting programs  that  use  different
       conventions for representing non-ASCII characters.

              char-stmt :    wordchars character-range
                        |    wordchars lowercase-range uppercase-range
                        |    boundarychars character-range
                        |    boundarychars lowercase-range uppercase-range
              string-stmt :  stringchar string
                        |    stringchar lowercase-string uppercase-string

       Characters  described  with  the boundarychars statement are considered
       part of a word only if they appear singly, embedded between  characters
       declared  with the wordchars or stringchar statements.  For example, if
       the hyphen is a boundary character (useful in French), the string "foo-
       bar" would be a single word, but "-foo" would be the same as "foo", and
       "foo--bar" would be two words separated by non-word characters.

       If two ranges or strings are given in a char-stmt or  string-stmt,  the
       first  describes  characters  that are interpreted as lowercase and the
       second describes uppercase.  In the case of a stringchar statement, the
       two  strings  must be of the same length.  Also, in a stringchar state-
       ment, the actual strings may  contain  both  uppercase  and  characters
       themselves without difficulty; for instance, the statement

              stringchar     "\\*(sS"  "\\*(Ss"

       is  legal  and will not interfere with (or be interfered with by) other
       declarations of of "s" and "S" as lower and upper case, respectively.

       A final note on string characters: some languages collate certain  spe-
       cial  characters  as if they were strings.  For example, the German "a-
       umlaut" is traditionally sorted as if it were "ae".  Ispell is not  ca-
       pable  of this; each character must be treated as an individual entity.
       So in certain cases, ispell will sort a list of words into a  different
       order than the standard "dictionary" order for the target language.

              alt-sets  :    alttype [ set-options ] [ alt-stmt* ]

       Because  different formatters use different notations to represent non-
       ASCII characters, ispell must be aware of the representations  used  by
       these formatters.  These are declared as alternate sets of string char-
       acters.

              alttype   :    altstringtype name suffix*

       The altstringtype statement introduces each set by declaring the  asso-
       ciated formatter name and filename suffix list.  This name and list are
       interpreted exactly as in the defstringtype statement above.  Following
       this  header  are  one  or  more  alt-stmts which declare the alternate
       string characters used by this formatter.

              alt-stmt  :    altstringchar alt-string std-string

       The altstringchar statement  describes  alternate  representations  for
       string  characters.  For example, the -mm macro package of troff repre-
       sents the German "a-umlaut" as a\*:, while TeX uses the  sequence  \"a.
       If  the  troff  versions  are  declared  as the standard versions using
       stringchar, the TeX versions may be declared as alternates by using the
       statement

              altstringchar  \\\"a     a\\*

       When  the  altstringchar  statement is used to specify alternate forms,
       all forms for a particular formatter must be  declared  together  as  a
       group.   Also,  each formatter or macro package must provide a complete
       set of characters, both upper- and lower-case, and  the  character  se-
       quences used for each formatter must be completely distinct.  Character
       sequences which describe upper- and lower-case  versions  of  the  same
       printable  character must also be the same length.  It may be necessary
       to define some new macros for a given formatter to  satisfy  these  re-
       strictions.   (The  current version of buildhash does not enforce these
       restrictions, but failure to obey them may result in errors  being  in-
       troduced into files that are processed with ispell.)

       An important minor point is that ispell assumes that all characters de-
       clared as wordchars or boundarychars will occupy exactly  one  position
       on the terminal screen.

       A  single character-set statement can declare either a single character
       or a contiguous range of characters.  A range is given as in egrep  and
       the shell: [a-z] means lowercase alphabetics; [^a-z] means all but low-
       ercase, etc.  All character-set statements are  combined  (unioned)  to
       produce  the  final list of characters that may be part of a word.  The
       collating order of the characters is defined by the order of their dec-
       laration;  if  a  range  is used, the characters are considered to have
       been declared in ASCII order.  Characters that have case  are  collated
       next to each other, with the uppercase character first.

       The  character-declaration  statements  have  a rather strange behavior
       caused by its need to match each lowercase character with its uppercase
       equivalent.   In  any  given  wordchars or boundarychars statement, the
       characters in each range are first sorted into a collating sequence  by
       their byte values, then matched one-for-one with the other range.  (The
       two ranges must have the same number of characters).  Thus,  for  exam-
       ple, the two statements:

              wordchars [aeiou] [AEIOU]
              wordchars [aeiou] [UOIEA]

       would  produce  exactly the same effect.  To get the vowels to match up
       "wrong", you would have to use separate statements:

              wordchars a U
              wordchars e O
              wordchars i I
              wordchars o E
              wordchars u A

       which would cause uppercase 'e' to be 'O', and lowercase 'O' to be 'e'.
       This  should  normally  be a problem only with languages that have been
       forced to use a strange collating sequence.  If your uppercase and low-
       ercase  letters  both  collate in the same order, you shouldn't have to
       worry about this "feature".

       The prefixes and suffixes sections have exactly the same syntax, except
       for the introductory keyword.

              prefixes  :    prefixes flagdef*
              suffixes  :    suffixes flagdef*
              flagdef   :    flag [*|~] char : repl*

       A prefix or suffix table consists of an introductory keyword and a list
       of flag definitions.  Flags can be defined more  than  once,  in  which
       case  the definitions are combined.  Each flag controls one or more re-
       pls (replacements) which are conditionally applied to the beginnings or
       endings of various words.

       Flags  are named by a single character char.  Depending on a configura-
       tion option, this character can be either any uppercase letter (the de-
       fault  configuration)  or  any  7-bit  ASCII character.  Most languages
       should be able to get along with just 26 flags.

       A flag character may be prefixed with one or  more  option  characters.
       (If  you  wish to use one of the option characters as a flag character,
       simply enclose it in double quotes.)

       The asterisk (*) option means that this  flag  participates  in  cross-
       product  formation.  This only matters if the file contains both prefix
       and suffix tables.  If so, all prefixes and suffixes marked with an as-
       terisk will be applied in all cross-combinations to the root word.  For
       example, consider the root fix with prefixes pre and in,  and  suffixes
       es  and  ed.   If all flags controlling these prefixes and suffixes are
       marked with an asterisk, then the single root fix would  also  generate
       prefix,  prefixes,  prefixed,  infix, infixes, infixed, fix, fixes, and
       fixed.  Cross-product formation can produce a  large  number  of  words
       quickly, some of which may be illegal, so watch out.  If cross-products
       produce illegal words, munchlist will not produce those  flag  combina-
       tions, and the flag will not be useful.

              repl :    condition* > [ - strip-string , ] append-string

       The  ~  option specifies that the associated flag is only active when a
       compound word is being formed.  This is useful in a language like  Ger-
       man, where the form of a word sometimes changes inside a compound.

       A repl is a conditional rule for modifying a root word.  Up to 8 condi-
       tions may be specified.  If the conditions are satisfied, the rules  on
       the right-hand side of the repl are applied, as follows:

       (1)    If a strip-string is given, it is first stripped from the begin-
              ning or ending (as appropriate) of the root word.

       (2)    Then the append-string is added at that point.

       For example, the condition .  means "any word",  and  the  condition  Y
       means "any word ending in Y".  The following (suffix) replacements:

              .    >    MENT
              Y    >    -Y,IES

       would change induce to inducement and fly to flies.  (If they were con-
       trolled by the same flag, they would also change fly to flyment,  which
       might not be what was wanted.  Munchlist can be used to protect against
       this sort of problem; see the command sequence given below.)

       No matter how much you might wish it, the strings on the right must  be
       strings  of  specific  characters,  not ranges.  The reasons are rooted
       deeply in the way ispell works, and it would be difficult or impossible
       to provide for more flexibility.  For example, you might wish to write:

              [EY] >    -[EY],IES

       This will not work.  Instead, you must use two separate rules:

              E    >    -E,IES
              Y    >    -Y,IES

       The application of repls can be restricted to certain words with condi-
       tions:

              condition :    { . | character | range }

       A condition is a restriction on the characters that adjoin, and/or  are
       replaced  by,  the right-hand side of the repl.  Up to 8 conditions may
       be given, which should be enough context for  anyone.   The  right-hand
       side  will be applied only if the conditions in the repl are satisfied.
       The conditions also implicitly define a length; roots shorter than  the
       number  of  conditions  will  not pass the test.  (As a special case, a
       condition of a single dot "." defines a length of  zero,  so  that  the
       rule  applies  to all words indiscriminately).  This length is indepen-
       dent of the separate test that insists that all flags produce an output
       word length of at least four.

       Conditions  that  are  single  characters  should be separated by white
       space.  For example, to specify words ending in "ED", write:

              E D  >    -ED,ING        # As in covered > covering

       If you write:

              ED   >    -ED,ING

       the effect will be the same as:

              [ED] >    -ED,ING

       As a final minor, but important point, it is sometimes  useful  to  re-
       build  a  dictionary file using an incompatible suffix file.  For exam-
       ple, suppose you expanded the "R" flag to generate "er" and "ers" (thus
       making  the  Z flag somewhat obsolete).  To build a new dictionary new-
       dict that, using newaffixes, will accept exactly the same list of words
       as  the  old list olddict did using oldaffixes, the -c switch of munch-
       list is useful, as in the following example:

              $ munchlist -c oldaffixes -l newaffixes olddict > newdict

       If you use this procedure, your new dictionary will always  accept  the
       same  list  the  original  did,  even if you badly screwed up the affix
       file.  This is because munchlist compares the words generated by a flag
       with the original word list, and refuses to use any flags that generate
       illegal words.

EXAMPLES
       As an example of conditional suffixes, here is the specification of the
       S flag from the English affix file:

              flag *S:
                  [^AEIOU]Y  >    -Y,IES    # As in imply > implies
                  [AEIOU]Y   >    S         # As in convey > conveys
                  [SXZH]     >    ES        # As in fix > fixes
                  [^SXZHY]   >    S         # As in bat > bats

       The  first  line applies to words ending in Y, but not in vowel-Y.  The
       second takes care of the vowel-Y words.  The third then  handles  those
       words  that  end  in a sibilant or near-sibilant, and the last picks up
       everything else.

       Note that the conditions are written very carefully so that they  apply
       to  disjoint  sets  of words.  In particular, note that the fourth line
       excludes words ending in Y as well as the obvious SXZH.  Otherwise,  it
       would convert "imply" into "implys".

       Although  the  English  affix  file does not do so, you can also have a
       flag generate more than one variation on a root word.  For example,  we
       could extend the English "R" flag as follows:

              flag *R:
                 E           >    R         # As in skate > skater
                 E           >    RS        # As in skate > skaters
                 [^AEIOU]Y   >    -Y,IER    # As in multiply > multiplier
                 [^AEIOU]Y   >    -Y,IERS   # As in multiply > multipliers
                 [AEIOU]Y    >    ER        # As in convey > conveyer
                 [AEIOU]Y    >    ERS       # As in convey > conveyers
                 [^EY]       >    ER        # As in build > builder
                 [^EY]       >    ERS       # As in build > builders

       This  flag  would  generate  both  "skater" and "skaters" from "skate".
       This capability can be very useful in languages that make use of  noun,
       verb,  and  adjective endings.  For instance, one could define a single
       flag that generated all of the German "weak" verb endings.

SEE ALSO
       ispell(1)

                                     local                           ISPELL(5)
Generated by dwww version 1.15 on Sun Jun 23 21:28:12 CEST 2024.