Català | Español | Euskara

Introduction and Antecedents

The use of ontologies and lexical hierarchies, offering the representation of semantic information of lexical units (semantic classes and lexical relations), can be considered as a solid strategy for Information Retrieval (query expansion, answer search systems, data mining), for Knowledge Management (document indexing), for Automatic Translation (interlingual lexical representation) and Terminology Automatic Extraction (thematic relevance of candidates).

We think it is relevant to approach the enrichment of an ontology model, EuroWordNet (turned almost to be the standard model due to its wide use in computacional linguistics), to specialized areas, for the interest and opportunity in the development of new applications in those specialized fields. We propose to do that from terminology automatic extraction, by the adaptation both of subject and language of the tool YATE (Vivaldi 2001), because having at our disposal an efficient and wide scope terminology extractor will be of help in order to build and update basic terminology resources also for the other mentioned fields (IR, AT, KM).

Moreover, the results of basic research in terminological units in context in previous related projects (TEXTERM2 and RICOTERM2) shows us that the semantic information and the lexical combinatory are the most relevant ones for automatic extraction in various specialized fields, above all in human and social science discourse, due to the fact that they do have neither morphological nor syntactic singularities as they are nearer to no specialized or general discourse. On the other hand, tackling the adaptation of a tool as YATE to a different typological language, as Basque, makes us put emphasis on semantic strategy over other linguistic strategies in the extractor. This is mainly due to the existing consensus of specialized knowledge among different languages (and a shared legal framework in the Law field). These elements can be used as starting hypotheses to justify the need of enlarge EWN to specialized fields.

In previous projects, La terminología científico-técnica: reconocimiento, análisis y extracción de información formal y semántica (DGES-PB-96-0293) TEXTERM. Textos especializados y terminología: selección y recuperación automática de la información (BFF-2000-0841), and TEXTERM2. Fundamentos, estrategias y herramientas para el procesamiento y extracción automáticos de información especializada ( BFF2003-02111 ), it has been empirically proved the adequacy of the theoretical proposal by which the units with terminological value can be described and explained as lexical units of natural language and by which we can also consider their specificity by the selection of semantic features activated in the use in discourse . We can find linguistic clues of specialized uses in texts. The specialized knowledge of a text can be formulated by a net of knowledge nodes (represented by terminological lexical units or syntactic combinations where at least one of these units appears). The clues from the units used as specialized ones can be of different types: the use of specific morphological and lexical units, frequency of use of morphological and lexical units related to the use of the same units in no specialized discourse, specific syntactic combinations, change of the syntactic value of some lexical units. Pragmatic conditions are the ones activating the selection of one or other features of lexical units. Therefore, lexical units with terminological value are the result of activating possible features contained in a lexicon. These results have been published in different articles and book chapters by the IULATERM group.

As for the applied aspect of the research, the different developments of the tool YATE (Vivaldi 2001) are the results of the different previous projects by the groups and various PhD theses related to those projects:

In TEXTERM (2000-2003), the tool was designed, combining morphological information (classic forms), syntactic information (structural patterns) and semantic information (labels from EuroWordNet) with statistical strategies, and the first version for Spanish and Medicine (PhD thesis by J. Vivaldi 2002) and the adaptations for Medicine (Catalan) and Human Genome (Catalan and Spanish) were built.

During TEXTERM2 (2003-2006) a first adaptation for Law in Catalan has been done (PhD thesis by O. Domènech 2006), and in RICOTERM-2 (2004-2007) the adaptation for Economics in Catalan and Spanish (ongoing PhD thesis by A. Joan) has been built, and the publication of a YATE adaptation handbook to language and specialized domain has been written (Joan, Lorente, Domènech, Estopà and Vivaldi 2006, forthcoming).

The adaptation of YATE to language and domain by the enrichment of EuroWordNet centers its work in the revision by hand of the EWN synsets for the identification of specific lexical relations of specialized areas for establishing what we call Domain Borders (DB) in the language of YATE. EWN limitations in specialized domains implies, in many cases, entering new synsets to be able to establish the corresponding DB in YATE. The constant evaluation of the tool, after the entering of new DB, allows to steadily improve the results until reaching the desired figures of recall and precision.

General Targets

These immediate antecedents have brought us the sufficient experience (and an efficient working methodology) to be able to approach the following tasks in a 3-year period :

The detail of these tasks can be simplified into the following specific targets :

Targets of Each Project and Coordination Devices

Subproject 1 (UPF):

Subproject 2 (EHU):

Coordination Devices:

Last update: 12-12-2007