The case for Data Patterns

The W3C RDF Data Shapes Working Group has hard time trying to name the technology it is chartered to deliver. A proposal by +Holger Knublauch for Linked Data Object Model has triggered a lively discussion even outside the W3C group forum, on the Dublin Core list where +Thomas Baker has supported and pushed further my suggestion to use data pattern instead of shape, model or schema in various combinations with linked and object. Since this terminological proposal has over the week-end made its way to the official proposal list, maybe it's time to justify and explain a bit more such a terminological choice, and what I put technically under this notion of pattern
I must admit I've not gone thoroughly through the Shapes WG long threads wondering, among other tricky questions, about resources and resource shapes, or if shapes are classes, and maybe the view I expose below is naive, but the overall impression I get is that all those efforts to ground the work on RDFS or OWL are just bringing about more confusion on the meaning of already overloaded terms. A parallel discussion has started from a false naive question by +Juan Sequeda on the Semantic Web list a few days ago on how to explore a SPARQL Endpoint. In this exchange with +Pavel Klinov, I take the position that exploring RDF data is looking for patterns, not for schema.
The terminological distinction is important. The notion of schema, or for that matter the alternative proposal model, is heavily overloaded in the minds of people with a database background, and it is on the other hand totally abused in the RDF world. Its use in the RDFS name itself was a big cause of confusion. Not to mention the more recent http://schema.org, which defines anything but a schema, even in its RDF expression. RDFS vocabularies or OWL ontologies are neither schemas or models as understood in the closed world of databases or XML, namely global structures which precede and control the creation and/or validation of data. Using the term schema in RDF landscape is in fact preventing people to grok that RDF data by design has no need for schema. No schema in a RDF dataset is not a bug, it's a feature. And the current raging debates is only showing that people put so many different meanings on schema when trying to use it about RDF data, that you better forget using it all all.
Patterns, on the other hand, can be present in data whether they have or not been a priori defined in a global schema or model. They can be observed over a whole dataset or only in parts of the data. They can be used for query, validation, and even making inferences. But they are agnostic about the various interpretations implied by such usages, they don't abide a priori by any closed or open world assumption.
Technically speaking, how can a data pattern be expressed? To anyone a bit familiar with SPARQL, it is formally equivalent to the content of a WHERE clause in a SPARQL query. Such a content, by the way, is indeed called by the SPARQL specification itself a graph pattern. Let me take a simple example which will meet hopefully en passant an issue expressed by +Karen Coyle, the fact that people (in the Shapes WG) have hard time thinking about data without types (classes). 

Let P1 be the following pattern (prefixes defined as per Linked Open Vocabularies).
?x   dcterms:creator  ?y.
?y   person:placeOfBirth ?z.
?z   dbpedia-owl:country  dbpedia:France
This pattern does not contain any rdf:type declaration, hence it does seem like a shape under any of the current definitions proposed by the Shapes WG. It is not attached to, even less defined as, an explicit class. It does not rely on any RDFS or OWL construct.
What is the possible use of such a pattern? A basic level of use would be to declare that it is present or even frequent in the dataset (the description of the use of a pattern in a dataset could provide a COUNT to figure the number of its occurrences), which means if you use it as a WHERE clause in a SPARQL query over the dataset, the result will not be empty and will represent a significant part of the data.
Another level would be to associate P1 by some logical connector to another pattern, for example let P2 be the following one.
?x    dcterms:title  ?title.
 FILTER (lang(?title) = "fr")
One can now constrain the dataset by the rule P1 => P2 (supposing here the variable ?x is defined globally over P1 and P2). Said in natural language, if the creator of some thing is born in France, then this thing has a title in French (which might be a silly assumption in general, but can make sense in my dataset about French works). Note again that there is no assumption on the type or class of ?x and ?p. Of course one can fetch the predicates in their respective ontologies using their URIs and look out for their rdfs:domain to infer some types. But you don't need to do that to make sense of the above constraint. Practically, this constraint would be validated on all or part of the dataset by the following query yielding an empty result.
?x    dcterms:creator  ?p.
?p    person:placeOfBirth ?place.
?place dbpedia-owl:country  dbpedia:France.
{?x    dcterms:title  ?title.
FILTER (lang(?title) = "fr")}
Not sure how P1 => P2 would be interpreted as an open world subsumption. Supposing you can interpret each of the patterns as some constructed OWL class for the common variable ?x, and write a subsumption axiom between those, not sure such an interpretation would be unique. Deriving types from patterns is something natural language and knowledge does all the time, but not sure if OWL for example is handling that kind of induction. There is certainly work on this subject I don't know of, but it's clearly not "basic" OWL.
In conclusion, I am not claiming that patterns and SPARQL covers all the needs and requirements of the Data Shapes charter, but I hope it shows at least that searching and validating data based on patterns can be achieved independently of RDFS or OWL constructs, and even of any rdf:type declaration.
Follow-up of the conversation on DC-Architecture list.

[EDITED 2015-01-27] After feedback from Holger and further reading of LDOM, it seems that the above P1 => P2 can be expressed as a LDOM Global Contraint encapsulating the SPARQL query, thus :
a ldom:GlobalConstraint ;
ldom:message "Things created by someone born in France must have a title in French" ;
ldom:level ldom:Warning ;
ldom:sparql """
?x    dcterms:creator  ?p.
?p    person:placeOfBirth ?place.
?place dbpedia-owl:country  dbpedia:France.
{?x    dcterms:title  ?title.
FILTER (lang(?title) = "fr")}
""" .


Untranslatables of philosophical engineering

It's been five years now since I first wrote here about translation and untranslatables. This paradigm has been on my mind ever since. Thanks to Santa Claus, the Dictionary of Untranslatables - English edition ten years after the original publication in French - has found its way to my bookshelves. This is a huge reference collaborative work which I could not recommend enough to lovers of languages and philosophy. That includes hopefully about anyone reading those lines.
By design, this work is a never-ending story, the untranslatable being, as Barbara Cassin keeps reminding us, what needs to be translated again and again, and we'll never be done with it. That's why adaptation of the Dictionary in more languages is planned or in the making.
It's interesting to explore the Dictionary to see if and how it helps us to understand the terms of our philosophical engineering realm, as Alexandre Monnin  and  Harry Halpin among others now like to call it. Those folks have already done a great job trying to shed a bit of philosophical light over our passionate permathreads about identity, reference, resourcedocument, work, concept and more of the same. If we are never done with those, it's not necessarily because philosophical engineers are either lousy philosophers or crappy engineers, or even both. Granted, all sorts of people have been involved in such debates, often with great innocence and not enough either philosophical or technical background to tackle such difficult issues. But many more are very good philosophers or smart engineers indeed, and a good bunch of them can proudly claim to be both. If those smart brains can't come after years of debate with definitive and clear agreements about such concepts, and how they should be translated in the Web languages and implemented in its very architecture, it's certainly because they are akin to the hard core of untranslatables that philosophers and linguists have kept trying to translate for ages. And with no surprise, some of them are already important entries in the said Dictionary, because they have a long history in pre-Web philosophy, like identity, reference, thing, work, representation, or word. Some more have minor or side entries, like topic or description, and some are absent because their conceptual difficulty has emerged as a pure product of the Web, like resource, content, or data
Translation issues for such hard-core concepts do not only happen when translating them into other natural languages than their original one, most of the time some dialect of Globish. They also happen any time one wants to cross-link or make interoperable different business, technical or domain dialects, even using the same (apparently) natural language. Classes of librarians, knowledge engineers bred in Description Logics and object-oriented developers are different animals, their actual operational semantics are radically different, but close enough to bring about potential confusion when those people try to sit and build something together, an event likely to happen in the context of the Web. The local usages all appear as avatars of some (the same, or not) fuzzy underlying untranslatable, having to do with hierarchy and inheritance, rooted in the pervasive genus-species paradigm, and bearing the same name here and there by chance and necessity of linguistic evolution, involving various contingent reasons such as finitude of lexicons, laziness or reluctance to coin new and specific terms, jokes and puns, lousy semantic extensions etc. 
If the Web is here to stay as a major production of human knowledge, and considered as such by philosophers, a future revision or extension of the Dictionary will (should) include those untranslatables of philosophical engineering. For the greatest benefit of both philosophers and engineers, and the delight of all those pretending to be both.


Vocabularies are finite, hence ambiguous.

Vocabularies have been for thousands of years our main weapons in the fierce war against ambiguity. The Web has enabled the continuation of this war with new weapons called URI and RDF. This new battlefield has seen an unprecedented proliferation of terms, entities and concepts. Although everyone in this space goes on recommending the reuse of existing concepts and terms, not to reinvent the wheel and so on, we all strive for accuracy, and since the existing terms are never exactly fitting either our data or view of the world, we feel forced to add to the pile. We reinvent the wheel because our stuff is just so slightly different.
There is no possible end to this process. To achieve perfect accuracy, get rid of all ambiguity, we would need infinite vocabularies. We all know from high school that actual infinity is impossible to achieve, and this is quite simple to understand. But unbound growth in a very large world, in other words potential infinity, is in practice as difficult to grasp as actual infinity. Both are, to paraphrase Woody Allen, very large near the end, and whatever the ability of the information system to scale (brainware, hardware and software all together), it will break at some point. If you say to someone that the universe is infinite, he's ready to accept it intellectually as a default option, because universe having limits is in fact more difficult to grasp, not so much because of its weird space-time geometry than because its actual size and proportions, finite but so large, discourage all attempt to achieve accurate physical or mental representation.
What do we bring home from that? That the finite nature of our vocabularies, even extended by the impressive growth of technologies, makes that we have to live with ambiguity forever. Hence we have to consider ambiguity not as a bug, but as a feature of our vocabularies. Unfortunately many people still do believe, or act as if they believe, that because they are the domain experts and have worked for years on it, their terms are perfectly accurate and free from ambiguity. Expressing the terms semantics in formal languages is just comforting some of them in this dangerous illusion. And thinking we can achieve non-ambiguity prevents research to focus on the real issue of how to practically deal with ambiguity with the agility and efficiency of natural language conversation.

[Edited 2014-07-22] For a quite entertaining introduction to the issue, see "How many things are there?" 


Dimensions of online identity (about:me)

Ongoing discussion on how social accounts should be represented in schema.org is quite interesting to follow. I've not yet put directly my pinch of salt in this soup, just posted a side note on Google+, which triggered a small forking debate. I'm confident enough in people at schema.org to come out with some pragmatic decision, hence as +Dan Brickley likes to say those days, I don't worry too much
But underneath the technical issues, arise some good questions about online identity. Some people in that discussion seem to consider that their social accounts are not really identifying them. In a recent post, I defended the opposite view that URIs of social profiles are maybe the most representative of the online identity, and should be used as primary URIs at least in contexts where social interaction is at stake. I would like to go a little further in this analysis.
The following diagram is inspired by the work of Fanny Georges, with whom I had a few exchanges back in 2009 about those subjects of online identity. For those who read French, you can find some of the original concepts in this paper (see diagram on page 3), where she introduces the notions of declarative identity, active identity, and computed identity. I come out with a slightly different representation, but I wanted to acknowledge the source of most concepts in the picture.

This diagram defines two dimensions of the online identity : a personal-social axis, and a declarative-active one. Each corner of the diagram represents a combination of two poles of those axes. You can figure yourself easily where any resource linked somehow to you will fit, but better clarify by some examples :
Bottom left you find your good old' 96 web page : been there, done that, my home, my kids, my research papers, my collection of old bikes, whatever. All chosen and made by you. Today you will find there a static online CV, for example.
Upper left contains anything said online about you, if you are (un)happy enough to be a public person : articles about you, photographs and videos, library records of your publications, a Wikipedia article about you if you are notable enough (unless you have mingled into its redaction, in which case it will be somewhere on the middle left).
Bottom right contains the traces of your individual interactions on the Web : the pages you visit, the searches you perform, the transactions you make etc. This part of your identity is split on many servers. A piece at your bank, a piece on Amazon, a piece at Google etc. This is the most obscure and frightening part of your identity, because you have no real control on that. Many systems know many things about you, that you might have forgotten.
Upper right contains all the interactions you have on the social Web : FB wall, comments on you blog posts, retweets, GMail etc.

Orthogonal to those two axes is whatever is computed from those data. Many things have been computed behind the scenes long ago from your personal activity (bottom right) : cookies on your browser, suggestions from Amazon, and all sorts of adware or malware entering your computer. Things computed from the social-active upper right are suggestions (friends, books ..) and anything Google or Facebook or whoever "thinks" you would like to do, read, buy etc.
The Semantic Web on the other hand, has been interested mainly in computing on declarative identity : DBpedia descriptions (upper left), FOAF profiles (bottom left). 
Google, for the Knowledge Graph, seems to gather stuff from all over the place : what I say and what I do, what others say about me and what other do in interaction with me. And at the end of the day, even if it's scary to see all this stuff put together and crunched by mysterious algorithms on Big G servers, all together it might yield a more balanced view of my identity than any of its aspects. That's why I take my Google+ URI to be as close as possible to the "about:me" node in the center of the diagram.

[Added 2014-04-14] See also Cybernetics and Conversation. Quote from the reference article (1996) 
Thus we find ourselves being constructed (defined, identified, distinguished) _by_ that conversation. From this point-of-view, our selves emerge as a consequence of conversation. Expressed more fully, conversation and identity arise together.


Query + Entity = Quentity

Neologisms are cool, particularly those of the portmanteau kind. Taking two old words and biding them together into a new hybrid semantic species is indeed as exciting, tricky and risky as tree grafting. And it takes some years, either for words or for trees, to figure out success or failure. Will you eventually harvest any fruit, will the hybrid survive at all? Nine years ago I introduced hubjects, and four years before that it was the semantopic map. Neither of those have grown in the expected direction or yielded the expected fruits, although they are both alive and well. Those poor results will not prevent me to try a new grafting experience with quentities, and let's meet here after 2020 under this new tree, and enjoy the fruits, if any.
So, what is this new semantic graft all about? I've ranted here last year about Google not exposing public linked data URIs for its Knowledge Graph entities, and defining linked entities jus as yet other queries. A similar criticism applies to the Bing version of the Knowledge Graph I just invited yesterday to play in this blog. But thinking twice about it, I wonder now if queries are not the right way, and maybe the best way, to consider entities in the Web of data. After all, many (most) URIs in the linked data cloud actually resolve to a query behind the scene, even if they look like plain vanilla URIs. URIs at DBpedia, Freebase, VIAF, WorldCat, OBO, Geonames (just to name a few) are deferenced through some query on some data base, which might be or not a SPARQL query on a triple store. 

Let's take this query which you can pass to the DBpedia SPARQL endpoint.
?quake  dbpprop:magnitude  ?mag
FILTER (contains(str(?quake), "earthquake"))
FILTER (contains(str(?quake), "/20"))
FILTER (?mag > 7)
FILTER (?mag < 10)
I've tweaked the filters in order to cope with the quite messy state of earthquakes data in DBpedia : no single class nor category for earthquakes, no consolidation of datatype in the values of magnitude (hence the max value filter), date absent or in weird formats, but fortunately quite standard URI fragment syntax (every 21st century earthquake has a URI starting with http://dbpedia.org/resource/20 and containing "earthquake"). Default explicit semantic filters, use syntactic ones ... if you know the implicit semantics of the syntax, of course.
Granted, this query is as ugly as the data themselves are, but the result is not that bad and one could proudly call this "List of major earthquakes in the 21st century, sorted by decreasing magnitude".

Now I've encapsulated the query on DBpedia endpoint into a tiny URI. Does http://bit.ly/1lNkb0R qualify as a URI for an entity in the common meaning of "named entity"? One can argue forever to know if that "List of major earthquakes in the 21st century" is or is not an entity, but in my opinion it is one, no more no less than every individual earthquake in that list (the ontological status of an individual earthquake is a tricky one, too, if you think about it). 
One can argue also that this entity is a shaky one, because the result of this query is bound to change. The current list in DBpedia might be inaccurate or incomplete, some instances might escape the filter for all sort of obvious reasons, and obviously new major earthquakes are bound to happen in this century. Moreover, a stable meaning for this URI depends on the availability and stability of the bit.ly redirection service, on the availability of the DBpedia SPARQL endpoint, on the stability of Wikipedia URI policy and DBpedia ontology. Given all those particularities, let's assume we have a new kind of entity, defined by a query, that I propose to call for this very reason a quentity (shortcut for query entity), and an associated URI which I would gladly call a qURI (shortcut for query URI).

This qURI of course makes sense only as long as the technical context in which you can perform it is available. But is it different for any other kind of URI? To figure what a URI means in the Web of data, you have to use a HTTP GET, which is nothing more than a query over the global data base which is the Web, and what you GET generally depends on the availability and stability of as many pieces of hardware, software and data as in the above example. 
Indeed any URI can be seen, no more no less than the above bit.ly one, as an encapsulated query, making sense only when it's launched across the Web. And is not the elusive entity represented (identified) by this URI better seen as the query itself rather than as the query result? The query is permanent, whereas the query result is dependent on the everchanging client-server conversation context. 

So, if you want some kind of permanence in what the URI defines or identifies or represents (pick your choice), look at the query itself, not at the result. If you abide by this strange but inescapable conclusion, every entity in the Web is a quentity, and its URI is a qURI.

Follow-up discussion on Google+

Added 2014-04-09 : In the G+ discussion is introduced another and certainly better example to make my point : http://bit.ly/R2e3VV, a SPARQL CONSTRUCT yielding the same list in n3, making clear that the RDF description one GET from this URI does not, and cannot, include any triple describing the URI itself.


Bing Knowledge Graph

I've added the Bing Knowledge Graph widget to this blog, hoping that the introduction of Microsoft code is not a violation of the terms and conditions of Blogger. I guess Google will provide something similar pretty soon, based on its own Knowledge Graph. I wish there was some equivalent, non-proprietary widget leveraging entities in the Linked Open Data cloud.
The pages in this blog do not include many entities, though, and some results will certainly look funny. But I could not let pass this new step towards the Web of entities without having a try at it. Browsing around the pages, I noticed that the notion of "entities" is quite liberal, since for example "Philosophy" and "Semantic Web" are recognized, which is good news. Just everything can be an entity, as long as it can be identified.

[2014-07-21 : end of the experiment]


Linked Open Vocabularies, please meet Google+

The Google+ Linked Open Vocabularies community was created more than one year ago. The G+ community feature was new and trendy at the time, and the LOV community gathered quickly over one hundred members, then the hype moved to someting else, and the community went more or less dormant. Which is too bad, because Google+ communities could be very effective tools, if really used by their members, and LOV creators, publishers and users definitely need a dedicated community tool. We made lately another step towards better interfacing this Google+ community and the LOV data base. Whenever available, we now use in the data base the G+ URIs to identify the vocabulary creators and contributors. As of today,  we have managed to identify a little more than 60% of LOV creators and contributors this way. 
Among those, only a small minority (about 20%) is member of the above said community, which means about 80% of this community members are either lurkers of users of vocabularies. It means also that a good deal of people identified by a G+ profile in LOV still rarely or never use it. One could think that we should then look at other community tools. But there are at least two good reasons to stick to this choice.
Google+ aims at being a reliable identity provider. This was clearly expressed by Google at the very beginning of the service. The recent launch of "custom URIs" such as http://google.com/+BernardVatant through which a G+ account owner can claim her "real name" in the google.com namespace is just a confirmation of this intention. "Vanity URLs" as some call them, are not only good at showing off or being cool. My guess is that they have some function in the big picture of the Google indexing scheme, and certainly something to do with the consolidation of the Knowledge Graph.
We need dynamic, social URIs. I already made this point at the end of the previous post. And the more so for URIs of living and active people. Using URIs of social networks will hopefully make obsolete the too long debate over "URI of Document" vs "URI of Entity". Such URIs are ambiguous, indeed, because we are ambiguous. 
The only strong argument against G+ URIs is that using URIs held by a private company namespace to identify people in an open knowledge project is a bad choice. Indeed, but alternatives might turn to be worse.