2012-03-22

LOV stories, Part 2 : Gardeners and Gatekeepers

In the previous post, we have seen why Linked Open Vocabularies should be managed following the principles and rules of the commons : shared resources in which each stakeholder has an equal interest
Which means that in theory at least, all stakeholders of the commons should be both users and gardeners. In the vocabularies commons the stakeholders (and hence potential gardeners) are as various as can be, actually they encompass all the actual or potential providers, curators or users of linked data, since there is no quality linked data without quality linked vocabularies, as we have explained in a recent post. But let's look at who are the actual current gardeners of the vocabulary commons.
The following (non-exhaustive) list shows a great deal of diversity of vocabulary publishers, which is good news.
  • Standard bodies such as W3C or DCMI
  • Institutional heritage curators such as libraries (Library of Congress, European national libraries)
  • Global organisations federating the work of the above (IFLA, OCLC, Europeana ...)
  • Press and Media groups and associations (New York Times, BBC, IPTC ...)
  • Major Web companies, for example through schema.org initiative
  • Governments and institutional data providers (data.gov, Ordance Survey, INSEE ...)
  • Academics and research centers (DERI, INRIA, ...)
  • Funded research projects (singularly those funded by the European Community)
  • SME, consulting companies, software editors (Talis, TopQuadrant, Mondeca ...)
  • Individual initiatives, more or less backed by one or more of the above (taxonconcept, geospecies, geonames, lingvoj, lexvo ...)
Who are the most important actors among all those? Who is shaping the landscape? Is there a true synergy and collaboration between all or some of them, or think tanks and plain ignorance of each other's work, or competition, or some kind of productive coopetition? Do some of those gardeners play also the role of gatekeepers, controlling de facto the access to the vocabulary commons? The complex reality is a bit of all that, and the many ways vocabularies evolve, interact and re-use each other often take counter-intuitive paths. Very smart ontologies built up by acknowledged experts backed by famous research centers and  substantial fundings may turn out to be barely used and ignored beyond closed academic circles, to the (justified) frustration of their creators. At the opposite, some half-baked low-cost vocabularies developed by freelance pionneers, although declaring themselves as unstable and work in progress, are widely re-used, along with, and at a level of trust apparently equivalent to, standards cast in stone after years of long discussions and rigourous development track.
Why is it so? Vocabulary production and re-use is a chaotic, non-linear process, but a general rule seems to be that social aspects and community interaction are at least as important as intrinsic quality when assessing a vocabulary.  A first re-use often starts a strong positive feedback loop, since a term which has been used is more visible and more likely to be re-used. Hence, small and simple vocabularies having a very low cost of production and maintenance can reward their creators with both strong visibility and unexpected responsibilities towards their actual and potential users. Responsibilities of which they are not necessarily aware of, or even taking for serious, and this is a point we'll be back to in the next post regarding sustainable management.
In any case, aware of it or not, on purpose or by chance, individual actors and the way they interact with the ecosytem have an important impact on its overall evolution, often more important that initial funding or institutional support. Outstanding positions in the landscape are actually gained through open-minded development process, ability to engage in true community conversation, to listen beyond closed circles and think tanks, to provide feedback to users and finally make the vocabulary a fruit of such exchanges. Correlation between density of vocabulary links and density of social links between their creators is something which would indeed be worth a detailed investigation.

Let's focus on a couple of exemplary gardening stories. Of course FOAF comes to mind first, but it is now so pervasive that it could appear as an happy exception, having the chance to be the first one to define obviously important classes such as Person and Organization. We have spoken about it at length in a recent post, so the mention here is just to insist that like in nature, being the first to occupy a niche is critical. So let's focus on some maybe less known stories.

GeoSpecies and TaxonConcept vocabularies and knowledge bases are developed by Peter de Vries. Peter is a biologist, dedicated to the integration into the linked data world of the complexity of biological taxonomy. This is a real challenge, since species are described in a large variety of information systems, nomenclatures and formats, and they are organized following diverse, concurrent and evolving classifications. Many efforts such as Encyclopedia of Life are heading towards aggregation of all information about a given species, including various classification and naming, status (living, extinct, endangered), academic publications, images etc. But data about a species also contains individual ground observations, and information on which species you could expect to meet in a given place on Earth at a given time. The latter question is what GeoSpecies is all about.
Many projects in biological ontology too often work in the closed think tanks of the discipline experts, considering only the representations strictly based on "scientific ontology", and with not much interest in integration with the outer general world of linked data. Those projects are backed by heavy-weight institutions, big fundings, and gurus in knowledge representation. I won't give any example to avoid going at war, but they are easy to find. At the opposite of this approach, Peter looks and listens outside its discipline, trying to re-use as much as possible what is already working and shared elsewhere, for example FOAF, SKOS, Geonames, Event and Time ontologies, even vocabularies used for science popularization in the media such as BBC Wildlife Ontology. This is exactly the kind of gardener you need for the commons, not caring only at its trees and bushes, but considering how they fit in the general landscape.
Typically, despite being backed for his work by his academic institution, Peter owns personnaly the DNS of his vocabularies. In a private discussion he explained it to me this way :
The domain name is portable and can move from institution to institution. In some ways it is much more sustainable that a number of other projects because it has very low overhead [...] There are other projects that rely on large grants or other external support that will completely disappear if their funding ends.
The wise gardener does not want any gatekeeper to prevent him from accessing the places he's been managing for years, so he's also the master of the gates.

Music Ontology is another example of vocabulary very well linked technically and socially, and moreover published along the best practices, well documented, and used in many important datasets. Its creators, Frédérick Giasson, Yves Raymond and Bob Ferris, are multi-recidivists in vocabulary creation. They represent another type of gardeners, both technically savvy and socially open to collaborative work, with a native, built-in conception of the Web as a sharing knowledge place. Bob Ferris presents himself (among other various alias) as a "Philosophical Engineer". Music Ontology is surrounded by a number of related ontologies, in which the same group of people is also often involved. There again, along with the support of their backing institutions, the dedicated personality of the creators and their engagement in the community makes the success story. But it is not enough to ensure the success of every story. In a side exchange during the preparation of this post, Bob Ferris complains :
... the majority of the ontologies where I am a co-author are lacking a bit of the "community support" and utilisation in datasets (since they are not backed by a big funding project or other similar things to raise huge attention) [...] sometimes I have the feeling that people start re-using the wrong thing ...
What can be answered is that if some vocabularies struggle to find their niche in the LOD ecosystem, there are as many data looking for good vocabularies than vocabularies looking for data. A good vision of the global ecosystem is necessary to make them meet. That's what LOV is about ...

A somewhat different, but altogether instructive story is told by some vocabularies of which namespace should inspire trust, since it's also hosting the standards supporting all the Semantic Web infrastructure pile : RDF, RDFS, OWL, SKOS. But along with those famous ones, other vocabularies in the W3C namespace have been created by individual people, some of whom are not members of the W3C team any more. Those vocabularies have no status on the W3C recommendation track, and curiously enough are not even clearly endorsed by the W3C although they are published under its namespace. It seems that the status of such vocabularies is a sort of standby, no one knows even at W3C what to do about them. Are they obsolete? Are they reliable? Are they here to stay? Their creators have no idea of what will happen to them. Hopefully, the institution can at some point look into the issue and do something about it.

In a similar situation are the many vocabularies developed as deliverables of funded projects, which are not curated any more when the project is finished, and to which nobody can change anything because they are deliverables. When contacted about such vocabularies, their creators either ignore the messages, or answer with a polite answer expressing their helplessness ...

Many more instructive examples can be found along the LOV alleys. One could think that all those stories are typical of a pionneering age which will soon be over, a time when no many institutions have yet taken the measure of the importance of all issues linked to vocabulary management, and the community was small and dedicated. Since linked data are more mainstream today, it might be likely that more institutional support will be given to vocabulary development, and future vocabularies be managed in more anonymous but more efficient ways, with less important social networking effects. Not sure if we even get there any time soon, and if this is a good thing to expect.

In any case, all the above stories show that there is still work to imagine sustainable ways of management of the vocabulary commons. We'll try to explore the paths towards such a future in the next post.