Tom Gruber on ontologies

Danny Ayers cited that interview (linked under the Title above), and posted this quote, which I have extended, and which, I think, is appropriate to any discussion about object identity. I am not sure what to make of the comment about state, but it sounds like he's making a case for a RESTful architecture. I leave that up for comments.
In fact, the World Wide Web is based on a semiformal ontology, and it shows how ontological commitment works in software interoperability. At its core, the concept of the hyperlink is based on an ontological commitment to object identity. In order to hyperlink to an object requires that there be a stable notion of object and that its identity doesn’t depend on context (which page I am on now, or time, or who I am). Most of the machinery of the early Web standards are specifications of what can be an object with identity, and how to identify it independently of context. These standards documents serve as ontologies - specifications of the concepts you need to commit to if you want to play fairly on the Web. If one built a system with these committments, all of the Web infrastructure works well. If you violate the spirit of the ontology - such as the agreement on identity - things don't work so well. For example, early Web servers often packed a lot of state into the URLs, which violated the notion of object identity. Systems built this way could not be searched, bookmarked, or mentioned in email messages. I think that there were design weaknesses in the ontologies - ambiguities in the standards documents - that allowed formal compatibility with the Web without a committment to the conceptualization on which it is based.


Web Proper Names

SWAD Forum is definitely the place to monitor those days. After introduction of Subject Indicator in SKOS-Core, Alistair Miles launched a very lively thread "Working around identity crisis" which has attracted Harry Halpin of University of Edinburgh to introduce yesterday in the debate an amazing paper called "Web Proper Names: Naming Referents on the Web".
The paper proposes a process, leveraging the statistical results yielded by search engines, to define and name bottom-up equivalence classes of URIs, which all together 'probably' are about the same thing. The concept is somehow similar to the notion of Subject Identity Measure, since the probability of sameness can be quantified.



Eugene Eric Kim has spoken at length about the Identity Commons and I-Names. The link under the title points to his post to the "yak" mailing list. I-Names form an important part of Augmented Social Networks, and, at the same time, they provide for subject identity for people who use them.
Briefly, i-names are like DNS for people. They're based on open standards, and the core infrastructure will be open source. They are designed to support services that will allow individuals to control their digital identities, a vision largely inspired by the recent whitepaper, "The Augmented Social Network: Building Identity and Trust into the Next-Generation Internet."

Finding Scams

"The increasing volume of financial scams operating via the Internet makes it difficult for regulators to identify and prosecute those responsible. ScamSeek is a document classification system that trawls Internet pages and classifies documents as scam, scam-like or non-scam. In its first trials hunting the public Internet it correctly identified and classified eighty percent of documents, leading to specific investigations by ASIC and referrals of some documents to other agencies for investigation."


Finding scientific topics

Other keywords: webmining, knowledge extraction

From the PNAS Mapping Knowledge Domains, we find the link under the title of this post. The topic has to do with various means, including probabilistic, by which scientific topics can be mined from a body of literature. I think this idea applies to those notions whereby subject identity is based on various properties, some of which are detected by datamining techniques. Requisite quote:
A first step in identifying the content of a document is determining which topics that document addresses. We describe a generative model for documents, introduced by Blei, Ng, and Jordan [Blei, D. M., Ng, A. Y. & Jordan, M. I. (2003) J. Machine Learn. Res. 3, 993-1022], in which each document is generated by choosing a distribution over topics and then choosing each word in the document from a topic selected according to this distribution. We then present a Markov chain Monte Carlo algorithm for inference in this model. We use this algorithm to analyze abstracts from PNAS by using Bayesian model selection to establish the number of topics. We show that the extracted topics capture meaningful structure in the data, consistent with the class designations provided by the authors of the articles, and outline further applications of this analysis, including identifying "hot topics" by examining temporal dynamics and tagging abstracts to illustrate semantic content.


Concise Bounded Descriptions

Thanks to a post by Danny Ayers, I got a chance to look at Nokia's latest contribution to open source software, Uriqa. Here's the requisite quote. I hope to have more to say about this later.

This document defines a concise bounded description of a resource in terms of an RDF graph [5], as a general and broadly optimal unit of specific knowledge about that resource to be utilized by, and/or interchanged between, semantic web agents.

Given a particular node in a particular RDF graph, a concise bounded description is a subgraph consisting of those statements which together constitute a focused body of knowledge about the resource denoted by that particular node.


New Working Draft for the Topic Maps Reference Model

Steve Newcomb has announced the release of a new working draft for the TMRM. "Significantly shorter", this new version has let aside convoluted details of previous versions about structure of assertions, to focus on management of subject identity, inside and across Topic Map Applications, which should "disclose" in particular:
  • the rules for determining when multiple proxies are surrogates for the same subject
  • the rules for merging the values of the properties of proxies, when it has been determined that the proxies are surrogates for the same subject and they need to be viewable as a single proxy.
I notice the use of the word "rules" here, although later on in the document more stress is put on "Subject Identity Properties". My guess is that ongoing debate on identification process could lead TMRM in a near future to shift from those "SIP" to "SIR" : "Subject Identification Rules".

[Update 2013-02-05] : Even if I'm not been talking much of topic maps in this blog for quite a while, this post is the second top viewed since 2008. The final Topic Maps Reference Model published in 2007 is available at Topic Maps Lab.


Of Presidents and Ontologies

Beyond its sheer content interest - providing RDF description of the new(?) US President - this article is also a good introduction to Tag URIs in the following terms:

Under no circumstance should a Semantic Web application attempt to derive meaning from a URI string alone. Tag URIs have exactly one purpose: to allow us to quickly, consistently create unique identifiers that aren't intended to be de-referenced and without needing to register a URI scheme with anyone.
Tag URIs go in a direction opposite to PSIs, since they can't be de-referenced on the Web. So they are supposed to be sort of "self-explaining" identifiers.


Identity and Disambiguation in Wikipedia

Browsing around "Identity" article in Wikipedia, I discovered an interesting page called "Identity (disambiguation)" linking to different flavors of identity in different contexts, and itself an instance of the more generic category of page called "Disambiguation", explaining
the pragmatic solutions adopted by Wikipedians on this issue:
"Disambiguation in Wikipedia and Wikimedia is the process of resolving the conflict that occurs when articles about two or more different topics have the same natural title."