Indexing SGML Documents.
NOTESThe examples given here use the OpenText software (LiveLink). However, the principles would hold for other methods of indexing.TOKENSLiveLink uses the concept of a 'token' to define what we normally would call a word. Divisions between tokens can be set to include characters other than spaces. This is done in a configuration file called the 'tokenizer'. The tokenizer file also allows for the elimination of some tokens from the index. This function is useful for eliminating stop words (the) or obscenities.REGIONSThe ability to search large indexes is greatly enhanced by region searching - much like field searches in a MARC record. It is sometimes important, for example, to search for Beaverbrook as an author rather than a subject.Regions can be defined in two ways. The "normal" way, is to define regions in the tokenizer file. For SGML files, a second method can be used. Livelink ships with a program that builds regions based on the DTD. Using this program adds some added flexibility in searching (a feature not used yet here at UNB). Note: The OpenText product uses a parser that is not up-to-date. This parser will fail if the SGML files contain empty regions - as most of our files do. This is why the regions were entered in the tokenizer file (see below). A sample tokenizer file. This one was created for an index to the Special Collections material we worked on in the course. Region entries have been added for the tags in the TEI Lite DTD. SEARCHING A CREATED INDEXSearching is possible both from a UNIX command line (yuch!) and through a WWW interface. The Web interface is controlled by a file supplied with the Livelink system and modified for each new database. Configuration files allow the database administrator to control the appearance of the results. |