2012-09-18

Don't feed the semantic black holes

If I remember correctly it was at Knowledge Technologies 2001Ann Wrightson explained us, during the informal RDF-Topic Maps session, how to build a semantic virus for Topic Maps, through abuse of subject indicator. At the time OWL and its now infamous owl:sameAs were not yet around, but the idea was identical : if several "topics" A, B, C, ... indicate the same "subject" X, then they should be merged into a single topic. In linked data land ten years after it's the same story : if RDF descriptions A, B, C ... declare a owl:sameAs link to X, then A and B are merged together with the current description of X.

Hence the very simple semantic virus concept :
1. Harvest all the topic identifiers you can grab from distributed topic maps (read today : URIs from distributed linked data).
2. Publish a new topic map adding a common subject indicator to every topic description you have harvested (read today : add owl:sameAs X to all resource descriptions)
Now if you query the resulting data base for the description of any topic (resource) in it you get just all elements of description of everything on anything. All the map is collapsed on a single heavy and meaningless node. An irreversible semantic collapse.

Feeding a data black hole
Not yet going to such extreme cases, in the linked data universe such things are happening. Either loose mapping between vaguely similar resources, mapping errors or deliberately malicious data have contributed to a slow but steady build-up of genuine semantic black holes, from which nothing meaningful can be extracted any more. Geographical entities seem currently the most obvious victims of such semantic collapse. Try this one from sameas.org, compare with the original geonames description, and try to figure where things went wrong ...

Hopefully in the near future, provenance indication using named graphs or any similar mechanism, will protect the planets and galaxies of linked data to fall into those traps. Meanwhile, when flying by such weird objects, don't bring your data too close to the horizon.