2015-01-26

The case for Data Patterns

The W3C RDF Data Shapes Working Group has hard time trying to name the technology it is chartered to deliver. A proposal by +Holger Knublauch for Linked Data Object Model has triggered a lively discussion even outside the W3C group forum, on the Dublin Core list where +Thomas Baker has supported and pushed further my suggestion to use data pattern instead of shape, model or schema in various combinations with linked and object. Since this terminological proposal has over the week-end made its way to the official proposal list, maybe it's time to justify and explain a bit more such a terminological choice, and what I put technically under this notion of pattern
I must admit I've not gone thoroughly through the Shapes WG long threads wondering, among other tricky questions, about resources and resource shapes, or if shapes are classes, and maybe the view I expose below is naive, but the overall impression I get is that all those efforts to ground the work on RDFS or OWL are just bringing about more confusion on the meaning of already overloaded terms. A parallel discussion has started from a false naive question by +Juan Sequeda on the Semantic Web list a few days ago on how to explore a SPARQL Endpoint. In this exchange with +Pavel Klinov, I take the position that exploring RDF data is looking for patterns, not for schema.
The terminological distinction is important. The notion of schema, or for that matter the alternative proposal model, is heavily overloaded in the minds of people with a database background, and it is on the other hand totally abused in the RDF world. Its use in the RDFS name itself was a big cause of confusion. Not to mention the more recent http://schema.org, which defines anything but a schema, even in its RDF expression. RDFS vocabularies or OWL ontologies are neither schemas or models as understood in the closed world of databases or XML, namely global structures which precede and control the creation and/or validation of data. Using the term schema in RDF landscape is in fact preventing people to grok that RDF data by design has no need for schema. No schema in a RDF dataset is not a bug, it's a feature. And the current raging debates is only showing that people put so many different meanings on schema when trying to use it about RDF data, that you better forget using it all all.
Patterns, on the other hand, can be present in data whether they have or not been a priori defined in a global schema or model. They can be observed over a whole dataset or only in parts of the data. They can be used for query, validation, and even making inferences. But they are agnostic about the various interpretations implied by such usages, they don't abide a priori by any closed or open world assumption.
Technically speaking, how can a data pattern be expressed? To anyone a bit familiar with SPARQL, it is formally equivalent to the content of a WHERE clause in a SPARQL query. Such a content, by the way, is indeed called by the SPARQL specification itself a graph pattern. Let me take a simple example which will meet hopefully en passant an issue expressed by +Karen Coyle, the fact that people (in the Shapes WG) have hard time thinking about data without types (classes). 

Let P1 be the following pattern (prefixes defined as per Linked Open Vocabularies).
{
?x   dcterms:creator  ?y.
?y   person:placeOfBirth ?z.
?z   dbpedia-owl:country  dbpedia:France
}
This pattern does not contain any rdf:type declaration, hence it does seem like a shape under any of the current definitions proposed by the Shapes WG. It is not attached to, even less defined as, an explicit class. It does not rely on any RDFS or OWL construct.
What is the possible use of such a pattern? A basic level of use would be to declare that it is present or even frequent in the dataset (the description of the use of a pattern in a dataset could provide a COUNT to figure the number of its occurrences), which means if you use it as a WHERE clause in a SPARQL query over the dataset, the result will not be empty and will represent a significant part of the data.
Another level would be to associate P1 by some logical connector to another pattern, for example let P2 be the following one.
{
?x    dcterms:title  ?title.
 FILTER (lang(?title) = "fr")
}
One can now constrain the dataset by the rule P1 => P2 (supposing here the variable ?x is defined globally over P1 and P2). Said in natural language, if the creator of some thing is born in France, then this thing has a title in French (which might be a silly assumption in general, but can make sense in my dataset about French works). Note again that there is no assumption on the type or class of ?x and ?p. Of course one can fetch the predicates in their respective ontologies using their URIs and look out for their rdfs:domain to infer some types. But you don't need to do that to make sense of the above constraint. Practically, this constraint would be validated on all or part of the dataset by the following query yielding an empty result.
SELECT*
WHERE
{
?x    dcterms:creator  ?p.
?p    person:placeOfBirth ?place.
?place dbpedia-owl:country  dbpedia:France.
FILTER NOT EXISTS
{?x    dcterms:title  ?title.
FILTER (lang(?title) = "fr")}
}
Not sure how P1 => P2 would be interpreted as an open world subsumption. Supposing you can interpret each of the patterns as some constructed OWL class for the common variable ?x, and write a subsumption axiom between those, not sure such an interpretation would be unique. Deriving types from patterns is something natural language and knowledge does all the time, but not sure if OWL for example is handling that kind of induction. There is certainly work on this subject I don't know of, but it's clearly not "basic" OWL.
In conclusion, I am not claiming that patterns and SPARQL covers all the needs and requirements of the Data Shapes charter, but I hope it shows at least that searching and validating data based on patterns can be achieved independently of RDFS or OWL constructs, and even of any rdf:type declaration.
Follow-up of the conversation on DC-Architecture list.

[EDITED 2015-01-27] After feedback from Holger and further reading of LDOM, it seems that the above P1 => P2 can be expressed as a LDOM Global Contraint encapsulating the SPARQL query, thus :
ex:MyConstraint
a ldom:GlobalConstraint ;
ldom:message "Things created by someone born in France must have a title in French" ;
ldom:level ldom:Warning ;
ldom:sparql """
SELECT*
WHERE 
{
?x    dcterms:creator  ?p.
?p    person:placeOfBirth ?place.
?place dbpedia-owl:country  dbpedia:France.
FILTER NOT EXISTS
{?x    dcterms:title  ?title.
FILTER (lang(?title) = "fr")}
  }
""" .