2006-12-29

Classifying is hard, tagging is worse

My work for Mondeca has been for years to help building classification schemes, ontologies and the like for a variety of customers. Most of the time this means formalization of implicit ontologies they already have in their data. And I don't have either to make any decision about actually populating the schemes, this task is left to human editors or automatic text mining engines. I sometimes take care of automatic migration of legacy content, but following rules decided with the customer. I'm very happy with all that, because I'm not good at classification. I tend to see so many subjects in anything, any interesting resource to classify seems so multi-dimensional that choosing a category always brings me to the fringe of undecision, and any decision I eventually make about it seems always arbitrary. Comes maybe from an ancient traumatic experience as Open Directory editor.

Sounds familiar? I already hear the folksonomy people crying : "Hey, of course, that's why tagging is so cool". As far as I am concerned, tagging is worse, it means more arbitrary decisions, because not only do I have to choose a category, I can choose more that one, or none at all, and I have to figure them myself. Way too many decisions ... That's why my browser bookmarks and email folders are a mess, why I have no del.icio.us account, why my Technorati profile is so low, etc ...

Beyond my own decision difficulties, there is something to be added as this now long discussion obout ontologies vs tagging. What I've learnt in science is that a good theory is a falsifiable one. What you assert using an ontology, whatever language or framework with declared formal semantics, is falsifiable. No formal semantics, no notion of true and false, hence no falsifiability. In other words, and to make it simple, an RDF assertion can be declared or inferred true or false vs a given ontology, a OWL class can be proven unsatisfiable etc. Nothing of the like with tags. Assignation of a tag cannot be proven true or false, or inconsistent. Tags are not falsifiable.
By the way, the same distinction is to be made for RDF vs Topic Maps. Topic Maps are not falsifiable, because they have no formal semantics. Now the question is to know is falsifiability, which has been proven to be critical in science, is also critical in information technologies.

That said, since the new Blogger version enables easy tagging (maybe the older version did also, but was never aware of it), and since there is now quite a bunch of posts on univers immedia, I decided to be brave and start tagging them, as thoughtlessly as possible. Starting by the more recent ones, I then shifted to the most ancient, a good occasion to revisit them if nothing else. The result you see on the left under "What". First impression is of course there are too many of them, but I will try to keep up that way throughout the blog just to see how it flies, then maybe keep only the most frequent ones if I end up with a too long list.

2006-12-27

OWL ontology for identity on the web

This paper by Valentina Presutti and Aldo Gangemi at SWAP2006 begins with a clear introduction to the issue of resource identity, and the ambiguity of the term resource itself. Then it goes on with a very smart OWL model attempting to articulate the various aspects of this concept. Maybe too smart and conceptual to become really popular, but interestingly enough, it goes against the popular Semantic Web assumption that URI can actually identify "non-addressable things", and is rather in the line of letting the referent entities outside of the identification framework.

The definitions of resource that can be found in literature show ambiguity, making the issue of handling the identification of a web resource very problematic.

Our approach restricts the nature of the web resource to that of a computational object. This choice is motivated by the fact that a resource is something that has to be addressable, and things like cars and people are not addressable for their nature. Hence, it is wrong in principle to use the same mechanism of addressing for entities that have such different sorts.

Migration to new Blogger version

I took the opportunity of this migration to change the layout template. The new one seems more readable, and anyway I was fed up with all those dots. I like this little architecture piece on the top left corner. Actually I think it's a lighthouse, since the template is called "Harbor", but it also looks to me like an observatory platform opening on the sky, "useful by its opening and central emptiness" ... well in the spirit of univers immedia.

The list of contributors does not show anymore, they have to do something about their Blogger account to be able to post again.

A couple of things I've been about lately

I've been silent here for over two months now, my blogging time devoted to the Mondeca blog in French Leçons de Choses. But there is a couple of things I've been working on, worth mentioning.

I've exchanged with Michel Biezunski on his Data Projection Model , and found out that its genericity and simplicity made it easy and straightforward to express the structure of Mondeca ITM, without the borderline hacking needed when using either OWL-RDF or XTM for the same task. Now open questions: What will happen with that model? Who will see the benefits over languages already in this space, and singularly over RDF? Who will build tools supporting it?

Been wondering if a semiotic approach could shed some light on our thoughts on referents, and came out with a RDF semiotic triangle. The URI is the signifier, the RDF description is the formalisation of the signified concept associated with the URI. The referent is out of the language and signs realm, and should stay there. In this approach, attempting to achieve a representation of the referent, even using tricks as blank resources or hubjects of any kind, is therefore a recursive trap and actually a non-sense. So any declaration of same-ness or identity of referents should be avoided. Only concepts bear identity, not their referents. From that point on, came to the idea that linking different concepts/signs (URI + RDF description) which humans consider to have more or less similar referent will take the form of processing rules, more than declarative semantics.

Thanks to Jakob Voss for this post in a long thread on public-esw-thes list, which really triggered a kind of illumination about this. As an example, trying to say that my SKOS concept a:Restaurant has the same referent as your OWL class b:Restaurant through any RDF declarative relation between those two resources shoud be avoided. But I can set in my system a functional rule expressing that any document of which subject is an instance of your b:Restaurant class will be indexed against my a:Restaurant concept. The referent is represented nowhere, but it is acting at the core of this rule.

Actually we have this very indexing rule mechanism working in some Mondeca applications, and I have submitted a paper to XTech 2007 about it. More to come if ever the paper is selected.

Lately, got interested again in triggering some process to have languages available not only as tags to use in XML, but as proper RDF resources. This is an old story tracking back to OASIS Published Subjects Technical Committees, and singularly PSI for languages. Track this topic on ESW Wiki, and see here for ongoing thread and more explanations. There again, my proposal is to forget absolute identification of a language by a URI. Concepts identified by URI are the properties and property values than can be declared for a language, and let applications decide on which properties are useful to them. No absolute rule saying that two descriptions refer to the same language.