Building subject headings for ULSED: part of the Unified Language System for Engineering Design

Abstract

Efforts in information retrieval and exchange across multiple information systems have long been hindered by the fact that different systems use different terminologies. National Library of Medicine (NLM) recognized this problem and started the Unified Medical Language System (UMLS) project that aims at providing a common platform for exchange of information between different medical information sources and assisting medical professionals and researchers in retrieving electronic biomedical information. In engineering design, same problem arises when one tries to gain information from multiple resources, e.g., from design document repository in a design firm, research literatures on engineering design or classroom materials for an engineering design course. We think that a system similar to UMLS can be of great help in integrating the information from all these different sources for engineering design. We call it Unified Language System for Engineering Design (ULSED).

As in UMLS, we need to build a meta-thesaurus, or a standard set of subject headings for engineering design first. Instead of the common approach of building the subject heading set manually, we would like to propose a procedure that can suggest the vocabularies automatically. One possible approach is applying key phrase extraction techniques on electronic engineering design documents.

Key phrases play an important role in information retrieval systems. They are heavily used in document indexing, user query expansion, and document summarization. Typically, authors are the ones who are responsible for assigning key phrases for their documents. However, most authors won't assign key phrases unless they are asked to explicitly. Even when they do, the key phrases are possibly too general or too specific for information retrieval purposes. To remedy this problem, an automatic key phrase extraction technique is proposed and tested in this research. The key phrase extraction task is formed as a multi-objective optimization problem and is solved with a genetic algorithm. Bookstein and Picard have both found that key phrases have certain clamping behaviors, meaning that they tend to cluster together. They proposed several statistic measures that can be used to judge if a certain phrase can be considered a key phrase. We took their measures as one of our optimization objective, i.e. to maximize the total statistical significance of a set of phrases. At the same time, we would like to get as few key phrases as possible (while maintaining the integrity of the phrase set), thus the other objective is to minimize the number of phrases selected. Preliminary results show that the genetic algorithm does have some success in selecting content-bearing key phrases from a set of pre-processed candidate phrases but is still not discriminating enough to be used without human assist. We are currently doing more evaluations on the results and adjusting genetic algorithm parameters in order to achieve better results.