Networked Knowledge Organization Systems and Services

The 5th European Networked Knowledge Organization Systems (NKOS) Workshop

Workshop at the 10th ECDL Conference, Alicante, Spain

September 21, 2006

Abstracts of presentations

Lauser , B., Sini, M., Liang, A., Keizer, J. and Katz, S.
From AGROVOC to the Agricultural Ontology Service / Concept Server. An OWL model for creating ontologies in the agricultural domain.

This paper illustrates the conversion from a traditional thesaurus in agriculture (AGROVOC) to a new system, the Agricultural Ontology Service Concept Server (AOS/CS). The Concept Server will serve as a multilingual repository of concepts in the agricultural domain providing ontological relationships and a rich, semantically sound terminology. The Food and Agriculture Organization recently developed the underlying model for this new system in the Web ontology language OWL. In this paper, we describe the purpose of this conversion and the use of OWL and highlight in particular the core features of the developed OWL model. We go on to explain how it evolves and differs from the traditional thesaurus approach.

Keywords

Ontologies, Thesauri, Semantic Web, OWL, Classification Schemes, Metadata, AGROVOC, AOS

Nicholson, D. and McCulloch, E.
HILT Phase III: Design requirements of an SRW-compliant Terminologies Mapping Pilot.

As Zeng and Chan (2004) note, Interoperability of knowledge organisation systems (KOS) is a key issue in today's networked environment. It is an issue likely to impact, in time, on the semantic web vision (Berners-Lee et al, 2001), but is more usually tackled at present in an information retrieval context . Information services employ a plethora of different subject schemes to describe their resources. In some cases, they use recognised standards, in others 'in-house' or even uncontrolled schemes. Either way, the practice acts as a barrier to effective cross-searching by subject over distributed information services. The issue has attracted a good deal of interest in recent years. Potential solutions proposed include linking or switching between schemes, mapping, derivation/modelling (see for example Doerr, 2001; Chan and Zeng, 2002), and automatic or semi-automatic classification (see for example Koch and Vizine-Goetz, 1998; Godby et al, 1999; Ardo, 2004). CARMEN (2000), LIMBER (2000), Renardus (2002), and MACS (2005) are amongst a range of recent projects that have tackled the problem, and key international players such as OCLC ( http://www.oclc.org/ ) have also done relevant work (see http://www.oclc.org/productworks/terminologiespilot.htm).... [Full PDF version of summary]

Price, S., Lykke Nielsen, M. and Delcambre, L.
The Feasibility of Using the Semantic Components Model for Indexing Documents in Digital Libraries.

Finding one or more documents that exactly answer a targeted information need often fails, especially in digital libraries that lack the hyperlink structure that is so successfully exploited by the page rank algorithm. In the absence of extensive hyperlinks, successfully matching document requests to document content is essential. Assignment of keywords using human intellectual processes is expensive and prone to inconsistency. Automated full-text indexing is less expensive but requires the searcher to anticipate the language used in relevant documents. We have developed a new model, which we call Semantic Components, that leverages expert knowledge about how information is organized and expressed within the domain and is intended to facilitate precise searching in domain-specific libraries. We hypothesize that semantic component indexing will yield improved search results over automated full-text indexing and that indexing using this model will be faster (and therefore cheaper) and more consistent than human keyword assignment. [Full PDF version of abstract]

Miles, A.,
A Theory of Retrieval Using Structured Vocabularies (SKOS, Preparing for Standardization).

This presentation gives an overview of the Theory of Retrieval Using Structured Vocabularies that forms the main body of Alistair Miles' recently published dissertation on Retrieval and the Semantic Web.

The motivation for this research is the well-known fact that building a controlled vocabulary is expensive. Once deployed, without continuous maintenance a controlled vocabulary can become irrelevant in a short time, but maintaining a controlled vocabulary is also expensive. If a vocabulary is maintained, then other systems that depend on the vocabulary must also be maintained, and that can be expensive. Finally, if more than one vocabulary must be used within a retrieval system, then vocabularies must be mapped to each other, and mapping is expensive.

Controlled vocabularies can enable uniquely functional information retrieval solutions, but if they are to prove genuinely worthwhile then theoretical and practical knowledge of how the best possible retrieval functionalities and performance can be obtained at the lowest possible cost is needed.

The Theory presented here approaches this need by first exploring fundamental working assumptions in the application of controlled vocabularies to information retrieval problems. By making these assumptions explicit, and by following these assumptions through to their operational and logical consequences, a theoretical foundation is established from which rigorous empirical validation and refinement of these assumptions may be sought. Empirical validation of a coherent theory provides the bridge between the ideal and the real worlds, between design and effective implementation.

In general terms, the Theory aims to provide a coherent account of the process by which a controlled vocabulary is used to index and then to retrieve objects, typically "documents". The emphasis is on how the structure of a vocabulary can be exploited to improve recall without comprimising precision. The Theory is also concerned with providing an account of how predictable results may be achieved in situations where more than one controlled vocabulary is deployed. In this case the focus is on demanding the minimum mapping information required to achieve automatic translations of either queries or indexes with predictable results.

The Theory also provides a logical framework within which the functionality of different retrieval applications may be directly compared. This is particularly useful, where differences in user interface design can obscure underlying similarities in functionality and implementation. This is a vital precursor to determining the requirements of machine-understandable languages for both vocabulary and index data. The Simple Knowledge Organisation Systems (SKOS) working drafts are to be developed towards W3C Recommendation status within the scope of the Semantic Web Deployment Working Group (SWDWG), and the initial task of the working group will be to specify formal requirements for SKOS. It is hoped that the Theory presented here will enable a comparative analysis of a set of use cases, which can then be translated into concrete and satisfiable requirements.

Kruk, R.S., Haslhofer, B., Piotrowski, Westerski, A. and Woroniecki, T.
The Role of Ontologies in Semantic Digital Libraries. [Full PDF v ersion of abstract]

Voss, J.
Knowledge Organization with Wikipedia: Joining the Free Encyclopaedia and Digital Libraries.

Extended Abstract

Five years after it was founded almost unintentionally, the free-content encyclopaedia Wikipedia is still growing with astonishing success. Many thousand volunteers have created 4 million articles in more than 100 languages and wikipedia.org is one of the 20 most visited websites worldwide, an impact libraries can only dream of (according to (Alexa, 2006) archive.org is ranked around 120 and loc.gov around 1,200, to mention only the most popular library sites). While libraries work with elaborate rules and experts there is no dedicated management in Wikipedia. Everyone can directly edit almost any article and standards only emerge by means of self-organization. This makes Wikipedia less reliable while libraries apperently provide objective and accurate information. Nevertheless many online searchers prefer Wikipedia as first reference. But the differences are not that wide: libraries and Wikipedia both aim to collect and arrange knowledge and try to make it accessible for everyone with information needs. Despite different methods they both share common goals so cooperation between libraries and Wikipedia makes sense.

In this presentation several strategies to connect Wikipedia and digital libraries are shown. In making Wikipedia part of a digital library (or the other way round), especially knowledge organisation systems play an important role. Since 2005 there is a cooperation between the German Wikipedia and the German National Library that involves the usage of Personennamendatei (PND) name authority file (Hengel and Pfeifer, 2005). Around 100,000 biographic articles in the German Wikipedia are equipped with metadata about persons that contain a PND number in around 20,000 cases. This number generates a link on which Wikipedia users can directely navigate to library catalogues to find publications from or about a specific person. A reciprocal link to Wikipedia articles is planned and the method could also be expanded to other authority files and maybe even subject headings (Voss, 2005). Subject indexing in Wikipedia is handled with so called categories. In fact this system of tagging Wikipedia articles is the first collaborative tagging system with multiple hierarchical relationships: a collaboratively created thesaurus (Voss, 2006). Based on this categories mappings between Wikipedia and other information systems could also be established. Methods of thesaurus and ontology matching will help to get concordances if legal restricions are solved. First experiments in Wikipedia show that indexing Wikipedia articles with a foreign classification is not suitable, but German Wikipedia's categories in the field of library and information science could successfully be mapped to the JITA Classification System of Library and Information Science.

Beside categories you can also directly use Wikipedia articles to index other resources. Wikipedia contains many articles about complex concepts but also articles about explicit entities like people, organisations, places and so on. Each article is identified by a unique name, so Wikipedia can also be seen as a controlled vocabulary. Homonyms are handled with disambiguation pages (http://en.wikipedia.org/wiki/Wikipedia:Disambiguation) that list all meanings of a word with links to the according articles, and synonyms are joined with redirects (http://en.wikipedia.org/wiki/Wikipedia:Redirect) which link to preferred terms. Wikipedia is also the first strict hypertextual encyclopaedia. Methods of network analysis and data mining will provide networks of concepts that can be used for browsing and mapping knowledge. An extension of MediaWiki (the software Wikipedia runs on) adds typed links and supports RDF (Vlkel et al, 2006) this lets you create semantic networks with a wiki and may integrate Wikipedia into the promised Semantic Web. Beside normal hyperlinks between Wikipedia articles there are specific links to other databases that can be used for integrated services. These special links mostly contain a unique identifier per article. Examples are ISBN and ISSN numbers, laws, patent numbers, digital object identifiers and links to the the Internet Movie Database (IMDb). A third type of links are links between Wikipedias in different languages (different language versions of Wikipedia are mostly independent and have different highlights and specialities).

Wikipedia provides a vast number of possibilities to connect its knowledge structure with other systems, especially digital libraries. However the Wiki paradigm with no firm rules and directions may be unfamiliar. Its self-organization allows flexible and quick solutions; virtually everything can be changed at any time but if there is no one willing to work on a specific task voluntarily then it won't be processed . Also essential to Wikipedia is its restriction to free content. All textual content is licensed under the GNU Free Documentation License (GFDL) that allows anyone to use, modify and republish the content as long as authors are named and derivated works are published under the same license. Keeping this in mind Wikipedia content can be used in portals, catalogue enrichment and other context. Connections with other databases facilitate browsing-structures over multiple information systems. The various prospects of collaboration are not even sighted.

References

Hengel, Christel and Pfeifer, Barbara (2005): Kooperation der Personennamendatei (PND) mit Wikipedia". In: Dialog mit Bibliotheken, volume 17, number 3, page 18-24.

Falquet, G., Nerima, L., Mottaz Jiang,C-L. and Ziswiler, J-C.
Integrating Semi-formal Knowledge Organization Structures.

ABSTRACT

The last years, we had been working on hyperbook structures to build digital libraries. A hyperbook is made of a domain ontology containing the most important concepts of the field or subject in question and of information fragments linked to the ontology's concepts. Fragments are text junks and serve primarily to define a concept, but they also can describe different aspects of the concept or can contain examples, references, etc. Optionally, links between fragments and concepts can be typed. The digital library is build by alignment of the different hyperbook ontologies that identifies equivalent and similar concepts. The aim is to create an extended view of each hyperbook in the form of a virtual document that provides readers with supplementary information found in the other hyperbooks, like additional examples, term definitions, more detailed or more general information, etc.

Much in the spirit of Marshall and Shipman outlining that "The difficulty of knowledge acquisition, representation and reasoning has a long history of being underestimated", the aim of inventing hyperbooks is to build a knowledge organization structure that is as easily to construct as low structured KOS (for instance glossaries or metadata annotated models like learning objects), but has a stronger semantic structure that can be used for the integration process.

Many research communities proposed to write full-fledged ontologies that result in a KOS with a strong semantic structure. With such kind of ontologies, it might be possible to process logic reasoning, which might become more difficult with a hyperbook structure that just contains a small domain ontology and textual fragments. On the other side, ontology built according to specifications like the ones proposed in the RDF/OWL family are time-consuming to construct and suitable only for homogenous domains. For instance, it might be possible to create an OWL ontology describing all elements of a house, but it seems nearly impossible to write an ontology about the United Nations under OWL specification.

Anyway, we found evidence through different example that the hyperbook structure is suitable to integrate hyperbooks into a digital library of hyperbooks. But concepts must be linked to representative fragments that either define, or describe, or show examples, or simply refer to the concept.

Last year, we tried to integrate two hyperbooks about agriculture politics made by domain specialists. A complete automatic integration approach allowed sorting out relations indicating equivalent and similar concepts.

Last winter, we let graduated students of a computer science course model hyperbooks about the topics of the course. We found a clear difference when comparing the students' hyperbook with the one build by domain specialists. Students found appropriate concepts, but finally didn't take a lot of care to select the fragments. This probably because we provided them with slides out of the course presentation and with selected publications around the course topics, so fragments we not easily to find and to write. Domain specialists can take advantage of documents of their daily work, so it might be easier for them to create well-done hyperbooks. We conclude that hyperbook creation is fastest when there exists already adequate material in a knowledge base that easily can be fragmented. Particularly, glossaries or similar KOS might be the best starting point for the construction of hyperbooks.

We propose the following integration process to assemble the digital library: First, we compute semantic similarities between concepts of the hyperbook. The mapping approach relies on both conceptual structure comparison (based on word matching, semantic neighbourhood matching and the positions in the "is-a" and "part-of" hierarchies) and fragment comparison. The existence of semantic similarity between fragments increases the concepts' similarity. Secondly, the weighted similarity links are used to generate a reading interface of an extended hyperbook by presenting the book content within its semantic context. We built a prototype to generate virtual documents of formal hyperbooks and to apply filtering, organization and assembling mechanisms. To avoid information overflow by attaching any kind of links to the initial hyperbook, we designed a graphical user interface generator that produces expand-in-place links for larger textual fragments that are showed to users after activating the corresponding link.

In the example with graduated students, it was more difficult to find appropriate similarities in a fully automatic integration process as with the hyperbooks built by domain experts. In this case, we need an alternative way to validate the determined relations. In social navigation and social bookmarking when a user follows a link or bookmarks a page, this action increases the score (or weight) of this link respectively page. So each one can benefit from everyone's experiences, discoveries, etc. We use a similar principle to establish more reliable semantic relations between ontologies if concept description is not well done. It process as follows:

When a user reads a hyperbook, the system automatically proposes links to (or expansion in place of) fragments from other hyperbooks according to the alignment described above. Then, the user can do three things:

1 and 2 reinforce the value of the similarity of the concepts involved in the link inference (2 is stronger), 3 weakens the similarity value. The underlying hypothesis is that the user will immediately see what is completely irrelevant.

Future work consists of adding functions to "create your own book". Such a tool would contain link proposals that the user can include or not, or introduce a very simple "language" to express inter-book queries to create dynamically derived fragments, e.g. "Find examples of this concept in other books" or "find more descriptions of this concept". Here again, the user should be able to accept or reject answers.

The NKOS community has had a longstanding aim to describe the different kinds of Knowledge Organisation Systems, as relevant to networked terminology services, and relate them to similar knowledge schemes in other information disciplines. An initial taxonomy of KOS was one of the first items posted on the NKOS website. However progress since then has been sporadic and there is a need to further advance this agenda. This proposal aims to take some further steps down this route. The presentation will review and delineate different types of KOS and discuss appropriate roles for KOS in the Semantic Web. It will attempt to build on previous NKOS work in this area, in particular by discussing use contexts for KOS at a level high enough to allow some rough comparison with other disciplines contributing to the Semantic Web. [Full PDF version of abstract]

Briefing Session

Terwilliger, J. F., Delcambre. L.M.L. and Logan, J.
User-Oriented Knowledge Organization Systems.

Overview

Data, as it resides in most databases or file systems, is incomprehensible to the average person. This problem arises largely because database schemas are not designed to be user-friendly; rather, they are intended to be efficient for storage and retrieval (Figure 1). Some databases serve as the back-end for sophisticated, domain-specific applications, such as electronic medical records or other clinical software (Figure 2). These software applications are obviously designed to be used by domain experts. We want to reuse the structure and content of the user interface for this kind of software application (used to capture data) to describe the data and allow domain experts who are not computer scientists to query the underlying data using this familiar structure for their data. In addition, we want to support users that would like to transform data in such databases into whatever form is necessary to perform a task, such as a statistical study. [Full PDF version of abstract]

Lykke Nielsen, M., Eslau, A.G.
Indexing challenges in work place information retrieval. [Full PDF version of abstract]

Classification schemes, thesauri, taxonomies etc are brought together in BS8723, "Structured vocabularies for information retrieval". Parts 1 and 2 are already published, but there is still time to have an input to the Parts addressing multilingual vocabularies, mapping, and formats and protocols for interoperability.

Ahmad, F., de la Chica, S., Sumner, T. and Martin, J.
Supporting Science Understanding through a Customized Learning Service for Concept Knowledge (CLICK) [Full PDF version of abstract]