Introduction and Antecedents
The use of ontologies and lexical hierarchies, offering the representation of semantic information of lexical units (semantic classes and lexical relations), can be considered as a solid strategy for Information Retrieval (query expansion, answer search systems, data mining), for Knowledge Management (document indexing), for Automatic Translation (interlingual lexical representation) and Terminology Automatic Extraction (thematic relevance of candidates).
We think it is relevant to approach the enrichment of an ontology model, EuroWordNet (turned almost to be the standard model due to its wide use in computacional linguistics), to specialized areas, for the interest and opportunity in the development of new applications in those specialized fields. We propose to do that from terminology automatic extraction, by the adaptation both of subject and language of the tool YATE (Vivaldi 2001), because having at our disposal an efficient and wide scope terminology extractor will be of help in order to build and update basic terminology resources also for the other mentioned fields (IR, AT, KM).
Moreover, the results of basic research in terminological units in context in previous related projects (TEXTERM2 and RICOTERM2) shows us that the semantic information and the lexical combinatory are the most relevant ones for automatic extraction in various specialized fields, above all in human and social science discourse, due to the fact that they do have neither morphological nor syntactic singularities as they are nearer to no specialized or general discourse. On the other hand, tackling the adaptation of a tool as YATE to a different typological language, as Basque, makes us put emphasis on semantic strategy over other linguistic strategies in the extractor. This is mainly due to the existing consensus of specialized knowledge among different languages (and a shared legal framework in the Law field). These elements can be used as starting hypotheses to justify the need of enlarge EWN to specialized fields.
In previous projects, La terminología científico-técnica: reconocimiento, análisis y extracción de información formal y semántica (DGES-PB-96-0293) TEXTERM. Textos especializados y terminología: selección y recuperación automática de la información (BFF-2000-0841), and TEXTERM2. Fundamentos, estrategias y herramientas para el procesamiento y extracción automáticos de información especializada ( BFF2003-02111 ), it has been empirically proved the adequacy of the theoretical proposal by which the units with terminological value can be described and explained as lexical units of natural language and by which we can also consider their specificity by the selection of semantic features activated in the use in discourse . We can find linguistic clues of specialized uses in texts. The specialized knowledge of a text can be formulated by a net of knowledge nodes (represented by terminological lexical units or syntactic combinations where at least one of these units appears). The clues from the units used as specialized ones can be of different types: the use of specific morphological and lexical units, frequency of use of morphological and lexical units related to the use of the same units in no specialized discourse, specific syntactic combinations, change of the syntactic value of some lexical units. Pragmatic conditions are the ones activating the selection of one or other features of lexical units. Therefore, lexical units with terminological value are the result of activating possible features contained in a lexicon. These results have been published in different articles and book chapters by the IULATERM group.
As for the applied aspect of the research, the different developments of the tool YATE (Vivaldi 2001) are the results of the different previous projects by the groups and various PhD theses related to those projects:
In TEXTERM (2000-2003), the tool was designed, combining morphological information (classic forms), syntactic information (structural patterns) and semantic information (labels from EuroWordNet) with statistical strategies, and the first version for Spanish and Medicine (PhD thesis by J. Vivaldi 2002) and the adaptations for Medicine (Catalan) and Human Genome (Catalan and Spanish) were built.
During TEXTERM2 (2003-2006) a first adaptation for Law in Catalan has been done (PhD thesis by O. Domènech 2006), and in RICOTERM-2 (2004-2007) the adaptation for Economics in Catalan and Spanish (ongoing PhD thesis by A. Joan) has been built, and the publication of a YATE adaptation handbook to language and specialized domain has been written (Joan, Lorente, Domènech, Estopà and Vivaldi 2006, forthcoming).
The adaptation of YATE to language and domain by the enrichment of EuroWordNet centers its work in the revision by hand of the EWN synsets for the identification of specific lexical relations of specialized areas for establishing what we call Domain Borders (DB) in the language of YATE. EWN limitations in specialized domains implies, in many cases, entering new synsets to be able to establish the corresponding DB in YATE. The constant evaluation of the tool, after the entering of new DB, allows to steadily improve the results until reaching the desired figures of recall and precision.
General Targets
These immediate antecedents have brought us the sufficient experience (and an efficient working methodology) to be able to approach the following tasks in a 3-year period :
- Finishing the adaptation to Law in Catalan
- Adapting to Informatics and Environment in Catalan
- Adapting to Law, Informatics and Environment in Spanish
- Finishing textual resources of specialized domains in Basque
- Analyzing processing labels and establishing format exchange protocols, so that YATE can operate with the Basque processed textual corpus
- Establishing dictionaries of Spanish-Catalan-Basque equivalents for the different subject fields and with identifying expressions of conceptual relations
- Enriching EWN with data from Basque about Economics, Medicine, Law, Informatics and Environment (for future adaptations of YATE).
- Evaluating the results of the application of YATE to specialized texts and different languages.
- Designing and implementing a new access platform to YATE tool, with different complementary applications related to the detection of conceptual relations
The detail of these tasks can be simplified into the following specific targets :
- The terminology automatic extraction tool YATE developed to cover the extraction in 6 subject domains and 3 languages (privileged situation in the state of the art)
- The evaluated working methodology for the adaptation of the extractor to new fields and languages. Contrasted evaluation of the extractor results (quality control).
- The enrichment of EuroWordNet in 6 specialized thematic fields and to 3 languages (impact in other projects)
- The accessibility to the resources and tools created and/or adapted.
Targets of Each Project and Coordination Devices
Subproject 1 (UPF):
- Completion and adaptation of YATE to the Law domain in Catalan. Enlargement of EWN. Results assessment.
- New adaptations of YATE in Catalan for Informatics and Environment. Enlargement of EWN. Results assessment.
- New adaptations of YATE in Spanish for Law, Informatics and Environment. Enlargement of EWN. Results assessment.
- Contrastive analysis of processing labels used in Spanish-Catalan and in Basque (shared task).
- Establishing protocols for data exchange from the contrastive analysis of tagsets used in this project.
- Listing expressions for the detection of conceptual relations in Catalan.
- Designing and implementing a new access gateway to YATE, with complementary applications in conceptual relations' detection.
Subproject 2 (EHU):
- Completion of specific-field textual resources for Basque: textual corpora of reference in Medicine, Law, Informatics and Environment through collaboration agreements with official bodies and companies.
- Linguistic processing of a significant sample of these corpora (lemmatization, morphological analysis and disambiguation).
- Contrastive analysis of processing labels used in Catalan-Spanish and in Basque (shared task).
- Establishing Basque equivalents for synsets incorporated in Spanish and Catalan in EWN (multilingual dictionary).
- Enrichment of EWN with data from Basque in the fields of Economics, Medicine, Law, Informatics and Environment (from the previous dictionary).
- Establishing equivalents in Basque and Spanish from the list of expressions of conceptual relations in Catalan.
- Designing and developing the webpage project. Portal of resources and tools used in the project.
Coordination Devices:
- Signing an agreement for the bilateral transfer of resources with research objectives.
- United meetings of the RS (Research Staff) of each subproject for planning tasks, exchanging resources, follow-up of the working schedule and diffusion of results (access gateway, project webpage, final publication, congress assistance planning). Contact with other research groups of the field.
- United meetings of the technological staff in both subprojects to talk about tagsets contrast and to exchange formats.
- United meetings of linguists in both subprojects for the united resolution of incidences in the enlargement of EWN and in its building process.