The South-East Asia Earthquake and Tsunami

The catastrophic actuality pushes me back to post on this page which had been silent for a while. A number of Web communities have quickly dedicated resources to inform and bring as much support as possible. Wikipedia has dedicated very rapidly a page to the event, which is updated regularly. WorldChanging has also focused on the event and community resources, and the specific SEA-EAT blog provides news and information about resources, aid, donations and volunteer efforts.


RDF/Topic Maps Interoperability Task Force

In his today's deviant chronicle, Edd Dumbill is wondering if the Semantic Web dreamers could not turn out to be right after all:
The topics of XML-oriented programming languages and the Semantic Web have been targets of mockery in their time, so this week I'm asking whether the true believers might be right.
And further on:
The unthinkable rapprochement between topic maps and RDF has occurred, signified by the formation of the W3C RDF/Topic Maps Interoperability Task Force. The task force is part of the Semantic Web Best Practices and Deployment Working Group. The last time I was with the majority of the people listed as members of the task force, it was in a very pleasant drinking establishment in Amsterdam. It's nice to think that the bonhomie of that evening has persisted into forming the basis of the task force.


They Ain’t Nothin’ ‘til I Call ‘em!

In today's article in EContentMag, Bob Doyle writes :
There is nothing wrong with creating new words for branding and marketing purposes, but it makes a company look foolish if they don't know the basic technical vocabulary of their own industry. The problem is that technical jargon is notoriously slippery and it is often difficult to determine whether a true meaning for a word even exists.


Semantic Association Identification and Knowledge Discovery...

Picked this one up at PlanetRDF and here.
Our goal is to research new techniques and improving effectiveness of techniques to identify semantic associations and knowledge discovery by exploiting a large knowledge base. Specific objectives include (a) ontology driven lazy semantic metadata extraction (i.e., annotation) to complement traditional active metadata extraction techniques, and (c) formal modeling and high-performance computation of semantic association discovery including ontology-based contextual processing and relevancy ranking of interesting relationships.

Contextually relevant links include InfoQuilt, and SCORE.


Tom Gruber on ontologies

Danny Ayers cited that interview (linked under the Title above), and posted this quote, which I have extended, and which, I think, is appropriate to any discussion about object identity. I am not sure what to make of the comment about state, but it sounds like he's making a case for a RESTful architecture. I leave that up for comments.
In fact, the World Wide Web is based on a semiformal ontology, and it shows how ontological commitment works in software interoperability. At its core, the concept of the hyperlink is based on an ontological commitment to object identity. In order to hyperlink to an object requires that there be a stable notion of object and that its identity doesn’t depend on context (which page I am on now, or time, or who I am). Most of the machinery of the early Web standards are specifications of what can be an object with identity, and how to identify it independently of context. These standards documents serve as ontologies - specifications of the concepts you need to commit to if you want to play fairly on the Web. If one built a system with these committments, all of the Web infrastructure works well. If you violate the spirit of the ontology - such as the agreement on identity - things don't work so well. For example, early Web servers often packed a lot of state into the URLs, which violated the notion of object identity. Systems built this way could not be searched, bookmarked, or mentioned in email messages. I think that there were design weaknesses in the ontologies - ambiguities in the standards documents - that allowed formal compatibility with the Web without a committment to the conceptualization on which it is based.


Web Proper Names

SWAD Forum is definitely the place to monitor those days. After introduction of Subject Indicator in SKOS-Core, Alistair Miles launched a very lively thread "Working around identity crisis" which has attracted Harry Halpin of University of Edinburgh to introduce yesterday in the debate an amazing paper called "Web Proper Names: Naming Referents on the Web".
The paper proposes a process, leveraging the statistical results yielded by search engines, to define and name bottom-up equivalence classes of URIs, which all together 'probably' are about the same thing. The concept is somehow similar to the notion of Subject Identity Measure, since the probability of sameness can be quantified.



Eugene Eric Kim has spoken at length about the Identity Commons and I-Names. The link under the title points to his post to the "yak" mailing list. I-Names form an important part of Augmented Social Networks, and, at the same time, they provide for subject identity for people who use them.
Briefly, i-names are like DNS for people. They're based on open standards, and the core infrastructure will be open source. They are designed to support services that will allow individuals to control their digital identities, a vision largely inspired by the recent whitepaper, "The Augmented Social Network: Building Identity and Trust into the Next-Generation Internet."

Finding Scams

"The increasing volume of financial scams operating via the Internet makes it difficult for regulators to identify and prosecute those responsible. ScamSeek is a document classification system that trawls Internet pages and classifies documents as scam, scam-like or non-scam. In its first trials hunting the public Internet it correctly identified and classified eighty percent of documents, leading to specific investigations by ASIC and referrals of some documents to other agencies for investigation."


Finding scientific topics

Other keywords: webmining, knowledge extraction

From the PNAS Mapping Knowledge Domains, we find the link under the title of this post. The topic has to do with various means, including probabilistic, by which scientific topics can be mined from a body of literature. I think this idea applies to those notions whereby subject identity is based on various properties, some of which are detected by datamining techniques. Requisite quote:
A first step in identifying the content of a document is determining which topics that document addresses. We describe a generative model for documents, introduced by Blei, Ng, and Jordan [Blei, D. M., Ng, A. Y. & Jordan, M. I. (2003) J. Machine Learn. Res. 3, 993-1022], in which each document is generated by choosing a distribution over topics and then choosing each word in the document from a topic selected according to this distribution. We then present a Markov chain Monte Carlo algorithm for inference in this model. We use this algorithm to analyze abstracts from PNAS by using Bayesian model selection to establish the number of topics. We show that the extracted topics capture meaningful structure in the data, consistent with the class designations provided by the authors of the articles, and outline further applications of this analysis, including identifying "hot topics" by examining temporal dynamics and tagging abstracts to illustrate semantic content.


Concise Bounded Descriptions

Thanks to a post by Danny Ayers, I got a chance to look at Nokia's latest contribution to open source software, Uriqa. Here's the requisite quote. I hope to have more to say about this later.

This document defines a concise bounded description of a resource in terms of an RDF graph [5], as a general and broadly optimal unit of specific knowledge about that resource to be utilized by, and/or interchanged between, semantic web agents.

Given a particular node in a particular RDF graph, a concise bounded description is a subgraph consisting of those statements which together constitute a focused body of knowledge about the resource denoted by that particular node.


New Working Draft for the Topic Maps Reference Model

Steve Newcomb has announced the release of a new working draft for the TMRM. "Significantly shorter", this new version has let aside convoluted details of previous versions about structure of assertions, to focus on management of subject identity, inside and across Topic Map Applications, which should "disclose" in particular:
  • the rules for determining when multiple proxies are surrogates for the same subject
  • the rules for merging the values of the properties of proxies, when it has been determined that the proxies are surrogates for the same subject and they need to be viewable as a single proxy.
I notice the use of the word "rules" here, although later on in the document more stress is put on "Subject Identity Properties". My guess is that ongoing debate on identification process could lead TMRM in a near future to shift from those "SIP" to "SIR" : "Subject Identification Rules".

[Update 2013-02-05] : Even if I'm not been talking much of topic maps in this blog for quite a while, this post is the second top viewed since 2008. The final Topic Maps Reference Model published in 2007 is available at Topic Maps Lab.


Of Presidents and Ontologies

Beyond its sheer content interest - providing RDF description of the new(?) US President - this article is also a good introduction to Tag URIs in the following terms:

Under no circumstance should a Semantic Web application attempt to derive meaning from a URI string alone. Tag URIs have exactly one purpose: to allow us to quickly, consistently create unique identifiers that aren't intended to be de-referenced and without needing to register a URI scheme with anyone.
Tag URIs go in a direction opposite to PSIs, since they can't be de-referenced on the Web. So they are supposed to be sort of "self-explaining" identifiers.


Identity and Disambiguation in Wikipedia

Browsing around "Identity" article in Wikipedia, I discovered an interesting page called "Identity (disambiguation)" linking to different flavors of identity in different contexts, and itself an instance of the more generic category of page called "Disambiguation", explaining
the pragmatic solutions adopted by Wikipedians on this issue:
"Disambiguation in Wikipedia and Wikimedia is the process of resolving the conflict that occurs when articles about two or more different topics have the same natural title."


On "who I am"

While roaming about in the space of massively multiplayer online games, from the learning perspective, I stumbled on the cognitive ethnography work of Constance Steinkuehler. Here, she says something about identity that, I think, is relevant to notions of identity and its articulation in computer records:
Because the activities I engage in are crucial to my identity. Who I am determines, and is reflexively determined by, my participation in various communities (Gee, 1999; Greeno, 1997). As Packer and Goicoechea (2000) put it, “A community of practice transforms nature into culture; it posits circumscribed practices for its members, possible ways of being human, possible ways to grasp the world—apprehended first with the body, then with tools and symbols—through participation in social practices in relationship with other people. Knowing is this grasping that is at the same time a way of participating and relating.” (p. 234) In other words, changes in knowing become changes in being: Through participation in a given Discourse (Gee, 1999), I do more than just acquire and reorganize mental representations of the world; who I am, who I see myself to be, is transformed by it. To quote Jodie Foster in her 90’s slasher film, “It changes me. And that changes everything.”

Vocabulary Management for the Semantic Web

Tom Baker, coordinator of the Vocabulary Management Task Force in the W3C SWBPD WG has released a first draft of the technical note which is the intended deliverable of the VM TF, shortly defined in the TF definition page as following:
A relatively concise technical note summarizing principles of good practice, with pointers to examples, about the identification of terms and term sets with URIs, related policies and etiquette, and expectations regarding documentation.
The current draft structure brings interestingly together practices from various communities and languages : FOAF, Dublin Core, SKOS, WordNet ... and Published Subjects. The release also includes an exhaustive list of links on the subject.The WG and the TF will have their F2F meeting next week in Bristol. I've been put under the interesting challenge to deliver something for the following section:

Section 3.3.
What does it mean to "use" Terms from one Vocabulary in another?
The problem of "semantic context". Terms may be embedded in clusters of relations from which they may be seen in part to derive their meaning. It may therefore not always be sensible to use those terms out of context. Examples include the terms of thesauri or ontologies, as well as XML elements, which may be defined with respect to parent elements and may therefore not always be reusable as properties in an RDF sense without violating their semantic intent.
In other words : does a term, clearly identified by its URI, but used in a context different from its original one, identify the same original concept?


Metadata for the masses

Free tagging. Interesting. Let folks tag (identify) objects according to their own whims. Well, it's different. I'm not analyzing it, just pointing at it.


Identity as process

In an earlier post, Bernard talked about the thought that there is no identity, only identification process. Music to my ears.

In an earlier post of mine, I pointed to the TMRM, the new reference model for topic maps. Now, I would like to point to a plethora of presentations at coolheads.com, the website of Steve Newcomb and Michel Biezunski. In particular, the slides titled "What is a Topic Map Application (TMA)?". Roam about those slides and it may become apparent to some readers that subject identity is, indeed, a process, one in which the topic map author discloses the means by which topic identity is established and is to be compared where merging of topics is a goal.

My own interpretation, not putting words in anyone else's mouth here, is that the TMRM aims to make subject identity a center stage process, where portions of a subject's identity are established through assertions. Follow me closely here. Assertions replace the familiar associations of XTM, performing the same function of establishing typed, scoped relationships between topics. They then, under direction of disclosures, perform a role in contributing information to a subject's identity through castings. I prefer to think of the casting topics in light of my seeing as discussion earlier.

My present interpretation, then, is that subject identity is a dance. A process. I won't argue against the use of the familiar PSI (URI) notation for some objects in some universe of discourse. Indeed, we can all agree on the concepts "mother", "father", "sister", and so forth, and I'd be very happy to disclose that PSIs for concepts like those would suffice, no matter what language the name string for such topics turns out to be. But, in routine conversations, thought processes, and some written stuff, I, like many others, tend to orbit around subject identity, talking about "passengers in cars" and the like, contextually sensitive identifiers which may link back anaphorically to some other statement. It's often a composition of many statements, assertions, that leads to identity. It's a process.


Typing : the main context in identification process?

Looking back at some recent posts, seems that most of the time, identification processes take place in contexts where the type (class, category ...) of the thing to identify has been explicitly or implicitly defined. It's explicitly said in the paper quoted in my yesterday post, and it appears in an ongoing thread on topicmapmail, where Lars Marius Garshol, speaking about Subject Identity Measure, writes:
"Another consideration is that I think types are extremely important. If the names are the same but the types are disjoint (person and place, say) then you can safely ignore the names. You might even want to make the algorithm consider typing topics first, and only afterwards go after the instances."
This is also maybe behind Jack's post "Seeing As..." . You can identify a "real" person only if you see it as a person, and on the other hand you can consider, given the appropriate identification context, any kind of data : a photograph, a phone number, an email address, a handwriting, the sound of a voice, a perfume ... as identifying a person if you see those data as persons, meaning that you have set an identification context where the type of object to identify is "Person".

Those considerations make me wonder about the credibility of URIs, PSIs and other kinds of "universal identifiers", if set outside any processing context, and maybe the minimal processing context should be the type of the thing identified. If http://psi.oasis-open.org/iso/639/#fra is set as a PSI for an instance of the class "Language", would it make any sense to use this identifier to identify a topic, without assuming implicitly that this topic is itself an instance of this same class?


The More Things Change, The More They Are the Same

Jack forwarded me last week an intriguing message from Chris Landauer, from some non-public forum, dealing with identity and undecidability.
"Another question came up about recognizing when two objects are the same. Since that question is formally undecidable even in polynomial expressions over the integers (Hilbert's 10th problem), I don't see it as being possible in the more complicated spaces of computing systems (...) without some kind of "cheating", to use the usual mathematical parlance.

In this case, one useful kind of cheating is to provide some very carefully engineered introspective processes that let the computing objects help to analyze and combine themselves with others. This notion of "Computational Reflection'' is one of the main principles that underlie Kirstie Bellman's and my theoretical computing research, which has shown that we can build such computing systems"
A quick search about Launder and Bellman led me to a very rich list of publications .

But further searching for "identity + undecidable", I stumbled on this amazing paper written in early Web days (Oct '93) by Henry G. Baker. Although written for the context of distributed computing, quite technical, dealing in-depth with various flavours of identity/equality of objects in object-oriented languages such as LISP, many excerpts of the paper make sense even for the non-programmer:
"Our model for object identity is similar to the concept of "operational identity", in which objects which behave the same should be the same."
Of course, to behave supposes a context of behaviour ... this conforts me in the line of thought that there is no identity, only identification process.


SKOS, some first impressions

Thanks to Bernard, I started looking at SKOS. Here are just a few first comments, mostly comparisons to topic maps, but also concerning an interpretation of the structures discussed in the SKOS document. I don't talk about subject identity in this post, but may return to that issue later.

From the SKOS metamodel

SKOS-Core allows you to define concepts and concept schemes.

A concept is any unit of thought that can be defined or described. 

A Subject is anything you can talk about.

A Topic is a proxy for a subject: one subject, one topic.

A concept scheme is a collection of concepts.

Sounds like a Category, to me. In fact, later in the SKOS document, they mention the terms fundamental category and fundamental facet.

A concept may have any number of attached labels.
A label is any word, phrase or symbol that can be used to refer to the concept by people.

A concept may have only one preferred label, and any number of alternative labels. 

A Topic can have any number of names, each with, or without a Scope.

A Topic can have one baseNameString, stated differently (with Scope, typically in different languages -- multilingual topic names).
SKOS facilitates scoping.

Relationships may be defined between concepts within the same concept scheme. Any such relationship is referred to here as a semantic relation.

Associations may be defined between Topics.

Relationships, as stated in SKOS, sound like the morphisms of category theory. From the perspective of representing, say, a thesarus, I can see the logic in having categories, as one might expect in word senses, where there is a root word which forms a category, capturing all of that word's derivatives within the same category. One can do that directly with the associations of topic maps. There still may be some merit in the category notion.

Mappings may be defined between concepts from different concept schemes. Any such mapping is referred to here as a semantic mapping. 

Nothing similar in topic maps. Mappings, as found here sound like the functors of category theory. The SKOS document doesn't appear to go much further with mappings.

I am beginning to suspect, without further research and based solely on a first impression of the opening paragraphs of the SKOS document, that there is a strong category-theoretic underpinning to SKOS. That would be a useful underpinning. I have a strong sense that there is a similar underpinning that can be interpreted in the TMRM.


Representation - Denotation in SKOS

A very active thread in SKOS forum about how to express the relationships between a concept expressed in SKOS, and various resources (individuals, classes or properties) based on the same concept in RDF and OWL.
The explanations of Dan Brickley provide a good abstract of the issue.


Guidelines for assigning identifiers to metadata terms

This document provides some simple guidelines for assigning identifiers to non-DCMI metadata terms (elements, element refinements, encoding schemes and vocabulary terms).

"Although these guidelines are mainly intended for metadata application profiles that conform with the Dublin Core Abstract Model, it is hoped that they are generic enough that they may be useful in the context of other metadata applications as well. "


Which is the most semantic?

Parsing the news as coming back from holidays, I stumbled on two interesting resources.

The first one is explicitly claiming to be in the Semantic Web framework:

The other one is just another Google avatar:

Go figure which is closer to the Semantic Web objectives ...


Automated Concept Identification within Legal Cases

A legal knowledge based system called JUSTICE is presented which can identify heterogeneous representations of concepts across all major Australian jurisdictions, and some concepts within US and UK cases. The knowledge representation scheme used for legal and common sense concepts is inspired by human processes for the identification of concepts and the expected order and location of concepts ...

The legal domain is certainly to monitor, since lawyers have been pionneering for thousands of years the domain of languages and methods for identification of complex and fuzzy concepts.


Identification as an experimental protocol

Answering to Jack about the two forms of identity (absolute and in context), my thesis here will be that there is neither any absolute identity of things, nor even maybe anything to identify, but only identification process, upon which both humans and systems have to agree. So, instead of wondering about the nature of identity, maybe we should try and follow an approach similar to the one Quantum Mechanics have introduced in Physics, focus on identification as an experimental protocol, and forget about the "Uncertain Reality" of the subject [1].
The GAIA exemple posted last week, shows what an identification process looks like in astronomy.

  • Collect data following a well-defined protocol.
  • Define which configuration of data defines a "punctual light source", otherwise said an object potentially identifiable as a star.
  • Define which sets of data (characterized by their types and value distribution) have characteristics conformant to a model of star emission (which is very tricky, since those characteristics vary in a very wide spectrum).
  • Compare those data together (each potential star will be observed many times during the mission).
  • If possible, compare the mission data to previous data and catalogs to match known objects with the ones defined by the mission (taking into account that a star is a living object, of which characteristics are a priori variable, even on small time scales).

The complexity of such a task, that should yield about one billion objects in the sky, is indeed quite similar to identification of the billions of resources on the Web which can be identified as representing a "punctual subject". Thinking that any subject could be represented simply by a single URI is as naive as thinking that any other star in the galaxy have a simple, single, straightforward and stable "observational signature".

[1] "Une Incertaine Réalité" is the original title of a book by Bernard d'Espagnat about the status of reality in Modern Physics. 
Gauthiers-Villars, 1985. ISBN: 2040164049. http://www.worldcat.org/oclc/420187422
Not sure it's been translated in English, but other of his books have been, e.g :
"Veiled Reality: An Analysis of Present-Day Quantum Mechanical Concepts"
Publisher: Perseus Books - ISBN: 081334087X. http://www.worldcat.org/oclc/30110071


Seeing As...

I just googled "seeing as" and got 405,000 hits. Now, there's a physterity if there ever was one. Some of those posts are really like "seeing as to how..." so they're off topic, sortof, but many of the others are close to the notions of subject identity that I'd like to toss out here. Specifically, I'd like to start the discussion around the notion that subject identity has two (maybe more!) senses, one of which is some absolute identity, and the other of which is context sensitive.

Absolute identity, itself, is a tricky thing. If you start with my name, Jack Park, and google that, you will get some good hits (moi!), and some not-so-good hits, like someone with my name authored a book on sporting events. So, names alone, won't cut it.

in context identity, now that's a whole 'nother bag-o-worms, with an entailment mesh as wide as the universe. So, I'm going to go out on a limb (copyright-wise) and type into here some of the the words of Gian-Carlo Rota [1] who was describing the words of Stanislaw Ulam, under the banner "In Memoriam of Stan Ulam: The Barrier of Meaning". I do so, to start a discussion about that marvelous mantra of working biologists: Context is everything, which comes from the amazing ability of a pluripotent stem cell to morph into just about any of the zillions of kinds of cells in an organism according to the local context, as defined by the hormonal bath surrounding it. This relates directly, I think, to part of the thinking behind the latest topic maps Reference Model in which identity, or the composition of identity in a topic is more robust than naming or published subject indicators. The quote:

"Now look at that man passing by in a car. How do you tell that it is not just a man you are seeing, but a passenger?
When you write down precise definitions for these words, you discover that what you are describing is not an object, but a function, a role that is inextricably tied to some context. Take away the context, and the meaning also disappears.
[skipped text]
Do you then propose that we give up mathematical logic? said I, in fake amazement.
Quite the opposite. Logic formalizes only very few of the processes by which we actually think. The time has come to enrich formal logic by adding to it some other fundamental [n]otions. What is it that you see when you see? You see an object as a key, you see a man in a car as a passenger, you see some sheets of paper as a book. It is the word "as" that must be mathematically formalized, on a par with the connectives "and," "or," "implies," and "not" that have already been accepted into a formal logic."

[1] in Physica 22D (1986) 1-3, North-holland, Amsterdam, and found in Doyne Farmer, Alan Lapedes, Norman Packard, Burton Wendrof, editors, Evolution, Games, and Learning: Models for Adaptation in Machines and Nature, North-Holland.



Toronto, Canada - August 24, 2004 - In a style more reminiscent of cave paintings or the scratchings of ancient Egyptians, scientists at the Blueprint Initiative (Blueprint) research program, led by Dr. Christopher Hogue at Mount Sinai Hospital's Samuel Lunenfeld Research Institute, have created a new visual language called OntoGlyphs to help scientists quickly identify the biological attributes of molecules in general and particularly the ones found in the Biomolecular Interaction Network Database (BIND).

"One of the biggest challenges that researchers face is trying to identify the individual molecular needles in the myriad haystacks of biological data. That's why we focused on developing a visual, 'glyphic' language, one that would allow researchers to identify patterns or connections at a glance."



Identification + Classification = GAIA

With no surprise, the first Google hit for 'Identification + Classification' is about Astronomy. http://www.mpia-hd.mpg.de/GAIA/
The GAIA project is aiming at identifying and classifying over 10^9 light sources, sorting them in stars, solar system objects, galaxies, quasars and the like. Astronomers have always led the way in classification and inventories. This is the next step ...


Reference by Description

Quite close to the previous post :

"How we refer to something is very important in exchanging information about that thing. If the two parties cannot agree on how to refer to a thing, they cannot exchange information about it."


[2015-02-09] Yet another dead link ...

Subject Identity

How things (abstract concepts or real-world stuff) are identified on the Web is a critical issue. I've posted some ideas about it in various places. See e.g.

A recent post by Stella Dextre Clarke in the SWAD Forum shows the complexity of the issue in the framework of Thesaurus management.

[Note 2013-02-05] : Amazed to find out that my Extreme Markup 2001 paper not only is still online on the OASIS archives of the Published Subjects Technical Committee, but shows in the first page of results in a Google search on "subject identity". 


SKOS is an open collaboration developing specifications and standards to support the use of knowledge organisation systems (KOS) on the semantic web. One of the most interesting on-going projects in this area.

[Update 2013-02-05] SKOS has become one of the core vocabularies used in the linked data ecosystem.