 |
Willpower Information
Information Management Consultants
|
Thesaurus principles and practice
This paper was originally presented at a workshop "Thesauri for museum documentation" held at
the Science Museum, London, on 24th February 1992. The proceedings
of the workshop have been published by the
mda (formerly the Museum Documentation Association).
- Why do we need a thesaurus?
- A limited list of indexing terms
- Hierarchical relationships
- Related terms
- Definitions and scope notes
- Form of the thesaurus
- Special factors relating to museum objects
- Use of a thesaurus when cataloguing
- Use and modification of existing thesauri
- Thesaurus maintenance
- What sort of fields is a thesaurus appropriate for?
- Other subject retrieval techniques
One of the reasons for documenting our collections is that we wish to be able to find objects of a
particular kind. We may ask "What thermometers do we have in the collection?", "What arrowheads?",
"What frocks?", "What whales?" or "What textile machinery?"
The
simple answer is that we give each item a "name", and then we can create a file of index cards, or
a computer file, in which we can search for these names and expect to find all the appropriate
items. This is the concept of the simple name field in the MDA data structure. It is
straightforward at first, and seems intuitive, but once you have documentation which has been built
up over time, perhaps by many different people, problems creep in unless there are rules and
guidelines to maintain consistency.
The word thesaurus is a rather fancy name, which has acquired a certain mystique,
because it is often bandied about as something necessary for effective information retrieval, but
something which sounds as though it will involve a lot of work. I have often heard curators say
"That's all very well if you have the time and resources, but I have this great backlog of
cataloguing to do, and I would never get through the half of it if I had to spend time setting up
anything as complicated as a thesaurus. What I need is a simple list of names which I can use to
index my objects."
My main purpose in this paper is to make three points:
- A simple name list without some rules will rapidly become a mess.
- Only three simple rules are needed; using them will make life easier for you, not harder.
- So long as you stick to these rules, you can take an existing thesaurus and adapt it to your
needs; you are not limited to using the terms which are listed in it already, and you are not
obliged to use more detail than you need.
What are these rules?
- Use a limited list of indexing terms, but plenty of entry terms
-- link these with USE and USE FOR (UF) relationships.
- Structure terms of the same type into hierarchies
-- link these with BROADER TERM/NARROWER TERM (BT/NT) relationships.
- Remind users of other terms to consider
-- link these with RELATED TERM/RELATED TERM (RT/RT) relationships.
I shall consider each of these rules in turn.
A major purpose of a thesaurus is to match the terms brought to the system by an enquirer with the
terms used by the indexer. Whenever there are alternative names for a type of item, we have to
choose one to use for indexing, and provide an entry under each of the others saying what the
preferred term is. If we index all full-length ladies' garments as dresses, then someone
who searches for frocks must be told that they should look for dresses instead.
This is no problem if the two words are really synonyms, and even if they do differ slightly in
meaning it may still be preferable to choose one and index everything under that. I do not know the
difference between dresses and frocks but I am fairly sure that someone searching
a modern clothing collection who was interested in the one would also want to see what had been
indexed under the other. We normally do this by linking the terms with the terms USE
and USE FOR , thus:
| Dresses |
USE FOR |
Frocks |
| Frocks |
USE |
Dresses |
This may be shown in a printed list, or it may be held in a computer system, which can make the
substitution automatically. If an indexer assigns the term Frocks, the computer will change it to
Dresses, and if someone searches for Frocks the computer will search for Dresses instead, so that
the same items will be retrieved whichever term is used. A friendly computer will explain what it
is doing, so that the user is not puzzled by being given items with terms different from those
asked for.
USE and USE FOR relationships are thus used between synonyms or
pairs of terms which are so nearly the same that they do not need to be distinguished in the
context of a particular collection. Other examples might be:
| Cloaks |
USE |
Capes |
| Capes |
USE FOR |
Cloaks |
|
| Nuclear energy |
USE |
Nuclear power |
| Nuclear power |
USE FOR |
Nuclear energy |
|
| Baby carriages |
USE |
Perambulators |
| Perambulators |
USE FOR |
Baby carriages |
| Perambulators |
USE FOR |
Prams |
| Prams |
USE |
Perambulators |
If we name objects, we want to be as specific as possible. If we have
worked hard to discern subtle distinctions in nature, type or style, we certainly want to record
these. The point is that the thesaurus is not the place to do this. Detailed description
of an object is the job of the catalogue record; the job of the thesaurus, and the index which
is built by allocating thesaurus terms to objects, is to provide useful access points by
which that record can be retrieved.
USE and USE FOR relationships can also be used to group similar
items together, because too much specificity is as bad as too little. If we have a small clothing
collection, containing ten jackets, it is more useful to give them all the index term jackets
than to create many specific categories. Anyone searching our catalogue will then be able to
search on the single term jackets and see a list of the ten items, each with a description
of exactly what kind of jacket it is, as follows:
|
Jackets:
|
| 1. |
Anorak in green cotton, England, 1985. |
| 2. |
Tweed sports jacket, Hawick, Scotland |
| 3. |
Silk bolero with floral embroidery, Spanish, 1930. |
If we used all the possible specific names, each of which would have only one or two items in
it, such as blazers, dinner jackets, boleros, donkey jackets, anoraks, flying jackets, sports
jackets, and so on, enquirers would have to search the catalogue under each name in turn in
order to find all the jackets in the collection, and they would never be sure that there was not a
kind of jacket that they had overlooked.
To help enquirers who approach the system by one of these terms, we therefore create the
references:
| Blazers |
USE |
Jackets |
| Dinner jackets |
USE |
Jackets |
and so on.
If we have a hundred jackets, a list under a single term will be too long to look through easily,
and we should use the more specific terms. In that case, we have to make sure that a user will know
what terms there are. We do this by writing a list of them under the general heading, thus:
| Jackets |
| NT |
Anoraks
Blazers
Boleros
Dinner jackets
Donkey jackets
Flying jackets
Kagouls
Sports jackets |
We could just invert terms and rely on the alphabet to bring them together, in a list such as
Jackets, dinner
Jackets, donkey
Jackets, flying
Jackets, sports |
but this is unreliable and subject to the vagaries of the language, which does not always
describe a specific type of item by an adjective preceding the generic name. We have to accommodate
types of jacket which have their own distinctive names such as Anoraks or Blazers.
| In both the above cases, it is important that the terms which are linked are of the
same type. That is to say that any narrower term must be a specific case of the broader term, and
able to inherit its characteristics. (The developers of Object Oriented Programming have recently
discovered this idea, which has been known to the worlds of information science and biological
taxonomy for a very long time.) Thus if we say that Blazers is a narrower term of
Jackets, we mean that every blazer is, whatever else it may be, inherently a jacket, and
that it has the characteristics which define a jacket.
Mice can properly be said to be a narrower term of Rodents, because all mice
are inherently rodents, but it is not correct to list Mice as a narrower term of
Pests, because some mice, such as laboratory mice and pet mice, are not pests. The idea is
to have relationships in the thesaurus which are always true, irrespective of context. In the same
way, it would not be correct to list Buses as a narrower term of Diesel-engined
vehicles, although many of them are; if we have a diesel-engined bus in our collection, we
should show this by giving it the two terms Buses and Diesel-engined vehicles. |
Broader and narrower terms
Hierarchical relationships
|
- Relationships must be independent of context
- Terms must represent the same type of entity
|
 |
Mice |
BT |
Rodents |
| Rodents |
NT |
Mice |
 |
Shoes |
BT |
Footwear |
| Footwear |
NT |
Shoes |
 |
Mice |
BT |
Pests |
| Pests |
NT |
Mice |
 |
Shoes |
BT |
Shoemaking |
| Shoemaking |
NT |
Shoes |
|
Good computer software should allow you to search for "Jackets and all its narrower
terms" as a single operation, so that it will not be necessary to type in all the
possibilities if you want to do a generic search:
If we restrict the hierarchical relationship to true specific/generic relationships, we need
another mechanism to draw attention to other terms which an indexer and a searcher should consider.
These are RELATED TERMS of the starting term. Related terms may be of several kinds:
- Objects and the discipline in which they are studied, such as Animals and Zoology.
- Process and their products, such as Weaving and Cloth.
- Tools and the processes in which they are used, such as Paint brushes and Painting.
It is also possible to use the RELATED TERM relationship between terms which are of
the same kind, not hierarchically related, but where someone looking for one ought also to consider
searching under the other, e.g. Beds RT Bedding; Quilts
RT Feathers; Floors RT Floor coverings.
| A thesaurus is not a dictionary, and it does not normally contain authoritative
definitions of the terms which it lists. It could perfectly well do this, but a lot more work would
be required to develop it in this way. In an automated system, however, the thesaurus would be a
logical place to record information which is common to all objects to which a term might be
applied, for example notes on the history and origin of Anoraks or the identifying characteristics
and lifestyle of Mice (or perhaps Mus musculus in a taxonomic thesaurus).
Where there is any doubt about the meaning of a term, or the types of objects which it is to
represent, a SCOPE NOTE (SN) is attached to it. For example,
| Fruit |
| SN |
distinguish from Fruits as an anatomical term |
| BT |
Foods |
|
| Preserves |
| SN |
includes jams |
|
| Neonates |
| SN |
covers children up to the age of about 4 weeks; includes premature infants |
A list based on these relationships can be arranged in various ways; alphabetical and hierarchical
sequences are usually required, and thesaurus software is generally designed to give both forms of
output from a single input. A typical simple thesaurus of a few clothing terms is shown in Tables 1
and 2. |
|
Table 1: Sample thesaurus - hierarchical sequence
|
knitwear
> cardigans
> pullovers
outerwear
> blouses
> cardigans
> coats
> > raincoats
> dresses
> jackets
> > anoraks
> > blazers
> > dinner jackets
> > donkey jackets
> > reefer jackets
> leggings
> pullovers
> rainwear
> > raincoats
> shawls
> shirts
> skirts
> suits
> trousers
> > jeans
> > shorts
> > slacks
|
|
|
Table 2: Sample thesaurus - alphabetical sequence
|
| anoraks |
| BT |
jackets |
|
| blazers |
| BT |
jackets |
|
| blouses |
UF
BT |
smocks
outerwear |
|
| breeches |
| USE |
trousers |
|
| capes |
| USE |
coats |
|
| cardigans |
| SN |
knitted jackets
with front opening |
| BT |
knitwear |
|
outerwear |
|
| cloaks |
| USE |
coats |
|
| coats |
| UF |
capes |
|
cloaks |
|
overcoats |
| BT |
outerwear |
| NT |
raincoats |
|
| dinner jackets |
| BT |
jackets |
|
|
| donkey jackets |
| BT |
jackets |
|
| dresses |
UF
BT |
frocks
outerwear |
|
| duffel jackets |
| USE |
reefer jackets |
|
| frocks |
| USE |
dresses |
|
| jackets |
| BT |
outerwear |
| NT |
anoraks |
|
blazers |
|
dinner jackets |
|
donkey jackets |
|
reefer jackets |
|
| jeans |
| BT |
trousers
|
|
| jumpers |
| USE |
pullovers
|
|
| knitwear |
| NT |
cardigans |
|
pullovers |
|
| leggings |
| BT |
outerwear |
|
|
| outerwear |
| NT |
blouses |
|
cardigans |
|
coats |
|
dresses |
|
jackets |
|
leggings |
|
pullovers |
|
rainwear |
|
shawls |
|
shirts |
|
skirts |
|
suits |
|
trousers |
|
| overcoats |
| USE |
coats |
|
| pullovers |
| UF |
jumpers |
|
sweaters |
| BT |
knitwear |
|
outerwear |
|
| raincoats |
| BT |
coats |
|
rainwear |
|
| rainwear |
| BT |
outerwear |
| NT |
raincoats |
|
| reefer jackets |
| UF |
duffel jackets |
| BT |
jackets |
|
|
| shawls |
| UF |
wraps (clothing) |
| BT |
outerwear |
|
| shirts |
| BT |
outerwear |
|
| shorts |
| BT |
trousers |
|
| skirts |
| BT |
outerwear |
|
| slacks |
| BT |
trousers |
|
| smocks |
| USE |
blouses |
|
| suits |
| BT |
outerwear |
|
| sweaters |
| USE |
pullovers |
|
| trousers |
| UF |
breeches |
| BT |
outerwear |
| NT |
jeans |
|
shorts |
|
slacks |
|
| wraps (clothing) |
| USE |
shawls |
|
Many thesauri have been created with the intention of being used to index documentary material, and
thus they include many terms which relate to abstract concepts, disciplines and areas of
discussion, as well as the names of concrete objects which are of primary interest to museums. We
have to be careful to be consistent in how we use these terms. The most straightforward way is to
concentrate first on what objects actually are - spades are Spades and should be given
this term, rather than the area in which they are used, whether it is gardening or gravedigging.
You may well wish to allocate abstract and discipline terms to objects too, so that you can
retrieve all the objects to do with Dentistry, Laundry, Warfare or Food
preparation. These terms can also be included in the thesaurus, so long as they are not given
hierarchical relationships to names of objects. They should be given RT
relationships to an appropriate level of object terms.
Some thesauri, such as ROOT, interfile terms of different types in their hierarchical display.
Indentation in such cases does not necessarily indicate a BT/NT relationship. The
relationships are shown in ROOT's alphabetical sequence, and it is unfortunate that they are not
distinguished in the hierarchical one.
Because these abstract terms do not describe what the object is, they could be put
into a field in the catalogue record labelled concept or subject, distinct from
the field containing terms which name the object. I do not think that such a distinction
will generally be helpful to users, however, and there seems to be no disadvantage in putting both
types of term into a single field so that they can easily be searched as alternatives or in
combination. Such a field would not be correctly called name and I therefore prefer to
call it simply indexing terms or subject indexing terms.
There has been much discussion on whether thesaurus terms should be expressed in the singular or
the plural. I believe that the difficulty arises from different views of what is being done when a
term is assigned to an object record. If a cataloguer thinks that (s)he is naming the object in
hand, (s)he will naturally use the singular: "This is a clock". If (s)he is assigning the object to
a category of similar objects, the thought will be "This belongs in the category of clocks". An
enquirer will normally ask for a category, so the latter form will be more natural and logical.
The point is not a trivial one, because as discussed in section 2 above
there is a conceptual difference between naming or describing an object and grouping it with others
so that it can be found. Both are essential steps, but an information retrieval thesaurus is
primarily concerned with grouping.
|
Singular or plural terms?
|
The cataloguer thinks:
"This is a clock".
|
 |
The enquirer asks:
"What clocks do you have?"
|
 |
Prefer plural terms because:
- We should design the catalogue to fit the way the user thinks.
- Clocks is the name of a category, including many types,
so plural is more logical.
|
The British Standard for thesaurus construction recommends
that plural terms should be used, except for a few well-defined cases, and my view is that this
practice should be followed. Unfortunately, there are many records in museum collections which have
been given singular "object names", and the work of changing these to plurals in a move to a
thesaurus structure may be so great as to require some compromise.
The British Standard recommends that when indexing parts or components, separate terms should be
assigned for the component and for the object of which it forms part, so that aircraft engines
would be indexed by the two terms Aircraft and Engines. This causes problems in a
museum collection, however, because items indexed in this way would be retrieved in a search for
Aircraft, when only whole aircraft were being sought. It therefore seems preferable to use
a term such as Aircraft components. A particular engine may well be an aircraft component,
but it is not an aircraft. Similarly a timer from a cooker can be indexed by the terms Timers
and Cooker components, and a handle broken from a vase might be indexed as
Handles and Vase fragments. There needs to be local agreement on how this
approach is to be applied to a particular collection.
In the thesaurus, BT/NT relationships can be used for parts and wholes in only
four special cases: parts of the body, places, disciplines and hierarchical social structures.
As shown in the sample thesaurus above, a term can have several broader terms, if it belongs to
several broader categories. The thesaurus is then said to be polyhierarchical. Cardigans,
for example, are simultaneously Knitwear and Jackets, and should be retrieved
whenever either of these categories is being searched for.
Art and architecture thesaurus (AAT), regrettably, allows a term
to have only one broader term, and I think that this is a serious drawback to its usefulness as an
information retrieval tool. It would take more space to repeat full hierarchies under each of
several broader terms in a printed version, but this can be overcome by using references, as
ROOT does. There is no difficulty in displaying polyhierarchies in
a computerised version of a thesaurus.
A thesaurus is an essential tool which must be at hand when indexing a collection of objects,
whether by writing catalogue cards by hand or by entering details directly into a computer. The
general principles to be followed are:
- Consider whether a searcher will be able to retrieve the item by a combination of the terms you
allocate.
- Use as many terms as are needed to provide required access points.
- If you allocate a specific term, do not also allocate that term's broader terms.
- Make sure that you include terms to express what the object is, irrespective of what it might
have been used for.
If you have a computerised thesaurus, with good software, this can give you a lot of direct help.
Ideally it should provide pop-up windows displaying thesaurus terms which the cataloguer can choose
from and then "paste" directly into the catalogue record without re-typing. It should be possible
to browse around the thesaurus, following its chain of relationships or displaying tree structures,
without having to exit the current catalogue record, and non-preferred terms should automatically
be replaced by their preferred equivalents. A cataloguer should be able to "force" new terms onto
the thesaurus, flagged for review later by the thesaurus editor. When editing thesaurus
relationships, reciprocals should be maintained automatically, and it should not be possible to
create inconsistent structures.
As there are many thesauri in existence already, it is worth considering seriously whether one of
these can be used before embarking on the job of creating a new one for a particular museum or
collection. So long as the general principles are followed, you should be able to expand a
thesaurus to give you more detail if you need it, or truncate some sections at a high level if they
contain more detail than your collections justify. So long as the relationships are universally
true, it should be possible to combine sections of thesauri developed by different museums and thus
avoid duplication of work.
Even when using an authoritative thesaurus, some care is needed, and I have mentioned some
limitations of ROOT and AAT in 7.1 and 7.4 above. It is still much easier to base your work on
something like these than to build your own from scratch, unless you have a very specialised
collection.
Someone has to be responsible for this. New terms can be suggested, and temporarily "forced" into
the thesaurus by cataloguers as they catalogue objects, but someone has to review these terms
regularly and either accept them and build them into the thesaurus structure, or else decide that
they are not appropriate for use as indexing terms. In that case they should generally be retained
as non-preferred terms with USE references to the preferred terms, so that people
who seek them will not be frustrated. An encouraging thought is that once the initial work of
setting up the thesaurus has been done, the number of new terms to be assessed each week should
decrease, and many systems have operated successfully in the past with printed thesauri, which are
quite difficult to keep up to date.
A thesaurus is not a panacea which will meet all subject retrieval needs. It is particularly
appropriate for fields which have a hierarchical structure, such as names of objects, subjects,
places, materials and disciplines, and it might also be used for styles and periods. A thesaurus
proper would not normally be used for names of people and organisations, but a similar tool, called
an authority file is usually used for these. The difference is that while an authority file has
preferred and non-preferred relationships, it does not have hierarchies.
[Authority files and thesauri are two examples of a generalised data structure which can allow
the indication of any type of relationship between two entries, and modern computer software should
allow different types of relationship to be included if needed.]
A thesaurus is an essential component for reliable information retrieval, but it can usefully be
complemented by two other types of subject retrieval mechanism.
While a thesaurus inherently contains a classification of terms in its hierarchical relationships,
it is intended for specific retrieval, and it is often useful to have another way of grouping
objects. This may relate to administrative distribution of responsibility for "collections" within
a museum, or to subdivisions of these collections into groups which depend on local emphasis. It is
also often necessary to be able to print a list of objects arranged by subject in a way which
differs from the alphabetical order of thesaurus terms. Each subject group may be expressed as a
compound phrase, and given a classification number or code to make sorting possible.
It is highly desirable to be able to search for specific words or phrases which occur in object
descriptions. These may identify individual items by unique words such as trade names which do not
occur often enough to justify inclusion in the thesaurus. A computer system may "invert" some or
all fields of the record, i.e. making all the words in them available for searching through a
free-text index, or it may be possible to scan records by reading them sequentially while looking
for particular words. The latter process is fairly slow, but is a useful way of refining a search
once an initial group has been selected by using thesaurus terms.
This document is at http://www.willpowerinfo.co.uk/thesprin.htm
Revised 13th February 1998
Comments and feedback on content or presentation are welcome and should be sent to
Leonard Will at L.Will@willpowerinfo.co.uk
Copyright © Willpower Information, 1998.