2015-01-28

Data Patterns, continued

Follow-up of the previous post, still trying to make sense of this pack of untranslatables : pattern vs schema vs structure vs model, and in particular how to draw the fine line between their descriptive and prescriptive aspects ... without spamming anymore the DC-Architecture list with this discussion with +Holger Knublauch which has somehow gone astray ...
Looking at pattern in the Wiktionary yields a lot of definitions, among others the following ones, broad enough to fit our purpose.
  • A naturally-occurring or random arrangement of shapes, colours etc. which have a regular or decorative effect. 
  • A particular sequence of events, facts etc. which can be understood, used to predict the future, or seen to have a mathematical, geometric, statistical etc. relationship. 
Further on in the same source, I discover that pattern can also be used as a verb (to pattern)
  • To make or design (anything) by, from, or after, something that serves as a pattern; to copy; to model; to imitate.

To discover, recognize, classify and name patterns in the world is a basic activity of our brain, and the very basis of our knowledge. Are those patterns emerging in our brains and projected on reality? Or does the world really signifies something to us (in the sense of the French faire signe) with those patterns, pointing to some internal logic and maybe meaning? I will keep agnostic here on this deep question, and rather look at an example which will bring us back to the questions of patterns in data.
What do we see in this image? Objects of various shapes, sizes and colors, connected by edges apparently not oriented. Some would call it a graph. Can you see any pattern? A casual look might miss it, and say those shapes, colours and sizes are rather random, their distribution is not really regular, although there are some vertical and horizontal alignments, groups of objects of the same color, and other groups of the same shape. A mix of order and random, like in the real world. Looking more closely, you will notice that connected objects share either a common color, or a common shape, or both (like the two red rectangles). This I will call a pattern.
We can now try to describe those objects in RDF data, using three predicates ex:shape, ex:color and ex:connected, and check if the pattern is general.

:blueMoon1  
    ex:shape  "moon";
    ex:color "blue";
    ex:connected  :blueTriangle1 .

:blueTriangle1  
    ex:shape  "triangle";
    ex:color "blue";
    ex:connected  :blueMoon1, blueEllipse1, redTriangle1 .

etc.

The pattern can be checked over the above data using this query

SELECT ?x
WHERE 
{
  ?x  ex:shape ?xShape.
  ?x  ex:color ?xColor.
  ?y  ex:shape ?yShape.
  ?y  ex:color ?yColor.
  ?x  ex:connected  ?y.
  FILTER (?xShape = ?yShape || ?xColor = ?yColor)
}

This query should yield all objects in the graph. If there is a handful of exceptions out of thousands of objects, I will certainly consider this is a general pattern, with some exceptions I will look closely at for further investigation. If this pattern is observed for, say, 60% of nodes, I will certainly consider it a frequent pattern. If the result is less than 10%, I will tend to consider it as a random structure rather than a pattern. All this activity is descriptive, with possible predictive purposes. I might have queried a part only of this graph because it has billions of objects, and assume the pattern is extending to the rest.

Can I turn this pattern into a prescriptive rule? Sure enough. If I want to create a new object connected to the yellow triangle at the bottom right, it has to be either a triangle (free color), or a yellow whatever (free shape), or both. But ... may I introduce new colors and new shapes, such as a yellow star or a purple triangle? In an open world, this is not forbidden by my pattern. But my closed system can be more restrictive, and limit the shapes and colors to those already known. 

I'm pretty sure that people asked to extend this graph, even after discovering the underlying pattern, will wonder for a while whether they are allowed or not to introduce a yellow star or a purple triangle, because neither star or purple appear in the current picture. It's likely that the most conformist of us will interpret the open pattern into a closed world schema, where objects can have only the shapes and colors already present. Not to mention the size, which has not been discussed, and not represented in the data. Imaginative people, certainly many children will take the open world assumption to invent freely new shapes with new colors, maybe joyfully breaking the pattern in many places. Logicians will be stuck in wondering which logic to use, and are likely to do nothing but argue why at length with each other.

What lessons do we bring home from this example?
  • Patterns can be discovered in data, or checked over data. 
  • The same observed pattern can be turned into an open world rule or included in a closed world schema, and there is not generally a single way to do either of those.
  • We should have a way to represent and expose patterns in data, independently of their further use. The current RDF pile of standards has nothing explicitely designed for such representations, but  SPARQL would be a good basis.
  • Patterns are not necessariliy linked to types or classes of objects. In our example, no rdf:type is either declared in the data or used in the SPARQL query.
For those who read French see also this post on Mondeca's blog Leçons de Choses Le toro bravo et le top model dated april 2010, showing those ruminations are not really new.