2015-04-14

Weaving beyond the Web

More on this story of names (including URIs) and text (including the Web), as promised to all those who have provided a much appreciated feedback to the previous post. I'm still a bit amazed by the feedback coming from the SEO community, because I really did not have SEO in mind. But I must admit I'm totally naive in this domain, and tend to stick to principles such as do what you have to do, say what you have to say, make it clear and explicit, and let search engines do their job, quality content will float towards the top. And explicit semantic markup is certainly part of the content quality. Very well ... but that was not my point at all. That said, any text is likely to be read and interpreted many ways, and there is often more in it than its author was aware of. And actually, this is akin to what I am about today, the meaning of a text beyond its original context of production.

Language is an efficient and resilient distributed memory, where names and statements can live as long as they are used. And even if not used any more, they can nevertheless live forever if part of some story we keep telling, reading, commenting and translating, some text we are still able to decipher. We still use or at least are able to make sense of texts forged by ancient languages thousands of years ago, even if the things they used to name and speak about do not exist any more. Dead people, buildings and cities returned to ground centuries ago, obsolete tools and ways of life, forgotten deities, concepts of which usage has faded away, the names of all those we nevertheless keep in the memory of languages - the texts. Some of us still read and make sense of ancient Greek and Latin, or even ancient Egypt hieroglyphs. The physical support of this memory has changed over time, from oral transmission to bamboo, clay tablets, papyrus, manuscripts and printed books, analogic and numeric supports of all kinds, today the cloud and what else tomorrow. Insofar as such migrations were possible at all, we trust the resilience of our language.

How do URIs fit in this story? URIs are a very recent kind of names, and RDF triples a new and peculiar form of weaving sentences. People who forged the first of them are still around, and they have been developed for a very specific technical context, which is the current architecture of the Web. Will they survive and mean something centuries from now? Do and will the billions of triples-statements-sentences we have written since the turn of the century make sense beyond the current context of the Web? Like Euclid's Elements, are they likely to live forever in long meaning?

Let's make a thought experiment to figure it. We are in 2115, the current Web architecture has been overriden since 2070 by some new technological infrastructure we can barely figure out in 2015, no more no less than our grandmothers in 1915 could figure the current Web architecture. HTTP is obsolete, data is exchanged through whatever new protocol. Good old HTTP URIs don't dereference to anything anymore since half a century. Do they still name something? Do the triples still make sense? Imagine you have saved all or part of the 2015 RDF content, and you have still software able to read it - just a text reader will do. Can you still make sense of it? Certainly, if you have a significant corpus. If you have the full download of 2015 DBpedia or WorldCat, most of its content should be understandable if the natural language has not changed too much. Hopefully this should be the case. We read without problem in 2015 the texts written by 1915. And if you have saved a triple store infrastructure and software, you might still be able to query those data in SPARQL by 2115. Triples are triples, either on the Web or outside it.

What lesson do we bring home from this travel to the future? Like any text, URIs and triples can survive and be meaningful well beyond the current Web infrastructure, they belong to the unfolding history of language and text. Of course today the Web infrastructure allows easy navigation, query and building services on top of them. But when forging URIs and weaving triples, consider that beyond the current Web what you write can live forever if it's worth it. Your text is likely to be translated into formats, languages and read through supports and infrastructures you just can't imagine today. Worth thinking about it before publishing. Text never dies.

2015-04-11

From names to sentences, the Web language story.

Conversation about text and names and how they are interwoven within the Web architexture is going on here and there. The more it goes, the more I feel we need more non technical narratives and metaphors to have people get what the (Semantic) Web is all about. We have drowned them under technical talks and schemas of layers of architecture and protocols and data structures and ontologies and applications ... and the neat result is that too many of them, and smart people, think only experts, engineers and geeks can grok it. So let me try one of such - hopefully simple - narratives. 

The story of the Web is just the story of language, continued by other means. Forging names to call things, and weaving those names in sentences and texts. On the Web, things have those weird names called URIs, but names all the same. As we have seen in a previous post, a name is to begin with a way to shout and identify people and things in the night. On the Web to call a thing by its URI-name you will use some interface, a browser, a service, an application, and at this call something will come through the interface. Well, the thing you have called does not actually come itself to you through the network, but you get something which is hopefully a good enough representation of the thing. The deep ontological question of the relationship between the name and what is named has been discussed for ages and will continue forever. The Web does not change that issue, does not solve it, just provides new use cases and occasions to wonder about it. But this is not my point today.

On the first ages of the Web, calling things was all you could do with those URI-names. You had the language ability of a two years old kid. You could say "orange" or "milk" when you were thirsty, and "dog" and "cat" and "car" and "sea" and "plane" when you saw or wanted one, and cry for everything else you could not express or the dumb Web would not understand. With no more sophisticated language constructs, you could nevertheless discover the wealth of the Web, through iterative serendipitous calls. Because courtesy of the Web is such that when you call for a thing the answer comes often back with a bunch of other names you can further call (an hyperlink just does that, enabling you to call another name just by a click). You would bring back home things you had not the faintest idea of the very existence a minute before. Remember this jubilation, the magic of your first Web navigation, twenty years ago? Like a kid laughing aloud when discovering the tremendous power of names to call things.
Today in many (most) of our interactions with the Web we are no more aware of using names. We make actions with our fingertips, barely guessing that under the hood, this is transformed in a client calling a server or something on this server by some name, and many calls are made on the network to bring back what your fingers asked. Only geeks and engineers know that. The youngest generations who have not known the first ages of the Web, and interact only through such interfaces, are plainly ignoring all that names affair. Did you say URL Dad? What's that? It sounds so 90's ...

Now when you grow older than two, you go beyond using names just for shouting them in the face of the world, you begin to understand and build yourself sentences. That's a complete new experience, a new dimension of language unfolding. You link names together, you discover the texture, the power to understand and invent stories and to ask and answer questions. You still use the same names, you are still interested in oranges, cats, dogs and cars, and all the thousands of things which are the children of naming. But you are now able to weave them together using verbs (predicates), qualifiers and quantifiers and logical coordination. You have become a language weaver.

And that's exactly and simply what the Semantic Web is about, and how it extends the previous Web. Just growing and learning to weave sentences, telling stories, asking questions. But using the same URI-names as before. Any URI-name of the good old Web can become a part of the Semantic Web. Just write a sentence, publish a triple using it as subject or object, and here you are. 

2015-03-19

Text = Data + Style

We used to consider the Web as an hypertext, a smart and wonderful extension of the writing space. It is now rather viewed and used as a huge connected and distributed data base. Search engines tend to become smart query interfaces for direct question-answering, rather than guides to the Web landscape. Writing-reading-browsing the hypertext, which was the main activity on the first Web, is more and more replaced by quick questions asking for quick answers in the form of data, if possible fitting the screen size of a mobile interface, and better encapsulated in applications. Is this the slow death of the Web of Text, killed by the Web of Data?
For a data miner, text is just a sort of primitive and cumbersome way to wrap data, from which the precious content has to be painfully extracted, like a gem from a dumb bedrock. But if you are a writer, you might consider the other way round that data is just what you are left with when you have stripped the text of its rhythm, flavor, eagerness from the writer to get in touch with the reader, in one word, style. Why would one bother about style? +Theodora Karamanlis puts it nicely in her blog Scripta Manum under the title "Writing: Where and How to begin".
You want readers to be able to differentiate you from amongst a group of other writers simply by looking at your  style: the “this-is-me-and-this-is-what-I-think” medium of writing. 
Writing on the Web is weaving, as we have seen in the previous post, and your style in this space is the specific texture you give to it locally, in both modern graphical sense and old meaning of way of weaving. The Web is indeed a unified (hyper)text space where anything can be weaved to anything else, but this is achieved through many local different styles or textures. It would be a pity to see this diversity and wealth drowned in the flood of data.
We've learnt those days that Google is working on a new kind of ranking, based on the quality of data (facts, statements, claims) contained in pages. But do or will search engines include style in their ranking algorithms? Can they measure it, and take it into account in search results and personal recommandations, based on your style or the styles you seem to like? Some algorithms are able to identify writing styles the same way other ones identify people and cats in images, or music performers. If I believe I Write Like I just tried on some posts of this blog, I'm supposed to write like I. Asimov or H.P. Lovecraft. Not sure how I should take that. But such technologies applied to compare blogs' styles could yield interesting results and maybe create new links that would not be discovered otherwise.
The bottom line of our data fanatics here could be that after all, style is just another data layer. I'm not ready yet to buy that. I prefer the metaphor of style as a texture. Data is so boring.

2015-03-11

... something borrowed, something blue

I already mentioned +Teodora Petkova in a recent post. Reading her blog, you'll maybe have as I had several times this "exactly ... that!" feeling you get when stumbling on words looking like they have been stolen from the tip of your tongue or pen. In particular don't miss this piece, with its lovely bride's rhyme metaphor, to be applied to every text we write in order to weave it with the web of all texts.
Something old, something new, something borrowed, something blue
Something old ... how can one write without using something old, since what is older than the very words and language we use to write? And one should use them with due respect and full knowledge of their long history. Let's look at some of those venerable words. Children of the Northern European languages, web and weaving seem to come from the same ancient root, hence Weaving the Web is a kind of pleonasm. And text comes from the Latin texo, texere, textus meaning also to weave, and cognate to the ancient Greek τέχνη, the ancestor of all our technics, technologies and architectures. In the Web technologies the northern germanic warp of words have been interwoven with the southern latin woof, and each new text on the Web is a knot in this amazing tapestry. Our Web of texts is not as bad as I wrote a few years ago, and with its patchy, fuzzy, furry and never-finished look, we love it and want to keep it that way.

Something new ... Text seems to be old out-fashioned stuff those days, it's data and multimedia and applications all over the place. Even the Semantic Web has been redubbed Web of Data by the W3C. And what if after Linked Open Data (2007) and Linked Open Vocabularies (2011), we were opening in 2015 the year of Linked Open Text?

Something borrowed ... Teodora encapsulates all the above with the concept of intertextuality. And that one I definitely borrow and adopt (just added it to the left menu), as well as the following from another great piece.
As every text starts and ends in and with another text and we are never-ending stories reaching out to find possible continuations…
Something blue ... The blue of links indeed, but to make the Linked Open Text happen and deliver its potential, we need certainly more than a shade of blue. As Jean-Michel Maulpoix writes in his Histoire du bleu ... All this blue is not of the same ink.
Tout ce bleu n’est pas de même encre.
On y discerne vaguement des étages et des sortes d’appartements, avec leurs numéros, leurs familles de conditions diverses, leurs papiers peints, leurs photographies, leurs vacances dans les Alpes et leurs terrasses sur l’Atlantique, les satisfactions ordinaires et les complications de leurs vies. La condition du bleu n’est pas la même selon la place qu’il occupe dans l’échelle des êtres, des teintes et des croyances. Les plus humbles se contentent des étages inférieurs avec leurs papiers gras et leurs graffitis : ils ne grimpent guère plus haut que les toits hérissés d’antennes. Les plus heureux volent parfois dans un impeccable azur et jettent sur les cités humaines ce beau regard panoramique qui distrayait autrefois les dieux.
To fly that high, we need indeed to invent and use new shades of blue to paint the links between our texts, and the words where those links are anchored. 

2015-03-02

Could computers invent language?

Artificial intelligence is something about which not a line has been written in these pages in next to two hundred posts and over more than ten years. But I feel today like I should drop a couple of thoughts about it, after exchanges on Google+ around this post by +Gideon Rosenblatt and that one by +Bill Slawski, not to mention recent fears expressed by more famous people.
There are many definitions of artificial intelligence, and I will not quote or choose any. Samely, popular issues I also prefer to let alone, like knowing if computers are able to deal only with data and algorithms, or if they can produce information or even knowledge, or if they think and can individually or collectively accede to consciousness or even wisdom. All those terms are fuzzy enough to allow anyone to write anything and its contrary on such issues. Let's rather look at some concrete applications.
Pattern recognition is one of the great and most popular achievements of artificial intelligence. Programs are now able with quite good performance to translate speech into written language, identify music tracks, cluster similar news, identify people and cats on photographs etc. 
Automatic translation is also quite popular, and working not that bad for simple factual texts, has still hard time dealing with context to solve ambiguity, understand puns and implicit references, all things generally associated with intelligent understanding of a text. 
Question-answering is also making great progress, based on more and more rich and complex knowledge graphs, and translation of natural language question into formal queries.
No doubt algorithms will continue to improve in those domains, with many useful applications and some related and important issues regarding privacy and delegation of decision to algorithms.

All the above tasks deal more or less with the ability of computers to process successfully our languages. But, and this is where I'm bound from the start, there is a fundamental capacity of human intelligence which, as far as I know, has not even began to be mimicked by algorithms. It's the capacity to invent language. It has been largely discussed since Wittgenstein whether a private language is possible or not, but there is no discussion that language has been and still is built collectively through a proceess of collective continuous invention. Anyone can invent a new word or a new linguistic form; whether it will be integrated into the language commons depends of many criteria akin to the ones enabling a new species to expand and survive or disappear. This is the way our languages constantly evolve and adapt to the changing world of our communication and discourse needs. Could computers be able to mimick such a process, take part in it, and even expand it further than humans? Could algorithms be able to produce new and relevant words, smoothly integrated in the existing language, to name concepts not yet discovered or named? In short, are computers able to take part in the continuous invention of language, and not only make a smart use of the existing one?
Such a perspective would be indeed fascinating and certainly scary, insofar as machines inventing collectively such language extensions would not necessarily share them with humans, and even if they do, humans would not necessarily be able to understand them

Whether such an evolution is possible at all or in a foreseeable future is a good question. Whether we should hope for it and work to let it happen, or should fear and prevent it, is yet a more interesting one. But at the very least, those questions we can technically specify, making them much more valuable for assessment and definition of artificial intelligence than vague digressions on whether computers can think, have knowledge or can become conscious. We don't even really know what the latter means for humans, our shared language being the closest proxy we have for whatever is going on in our brainware. So let's assess the progress of artificial intelligence by the same criteria we generally use to assess the human intelligence, its ability to deal with language, from plain naming of things to invention of new concepts.

2015-02-26

Statements are only statements

A few days ago in the comments of this post by +Teodora Petkova on Google+ I promised to +Aaron Bradley a post explaining why I am uneasy with the reference to things in Tim Berners-Lee's reference document defining (in 2006) Linked Data. The challenge was to make it readable by seven-years old kids or marketers, but I'm not sure the following meets this requirement.

When Google launched its Knowledge Graph (in 2012) with the tagline things, not strings, it was not much more than the principles of Linked Data as exposed in the above said document six years before, but implemented as a Google enclosure of mostly public source data, with neither API nor even public reusable URIs. I ranted here about that, and nothing seems to have changed since for that matter.
But something important I missed at the time is a subtle drift between TBL's prose and Google's one. The former speaks about things and information about those things. The latter starts by using also the term information, but switches rapidly to objects and facts.
[The Knowledge Graph] currently contains more than 500 million objects, as well as more than 3.5 billion facts about and relationships between these different objects.
The document uses "thing", "entity" and "object" at various places as apparent broad synonyms, conveying (maybe unwillingly) the (very naive) notion that the Knowledge Graph stands at a neat projection in data of "real-world" well-defined things-entities-objects and proven (true) facts about those. An impression reinforced by the use of expressions such as "Find the right thing". And actually, that's how most people are ready to buy it, "Don't be evil" implies "Don't lie, just facts". In a nutshell, if you want to know (true, proven, quality checked) facts about things, just ask Google. It's used to be just ask Wikipedia, but since the Knowledge Graph taps on Wikipedia, it inherits the trust in its source. But similarly naive presentations can be found here and there uttered by enthusiastic Linked Data supporters. Granted, TBL's discourse avoids reference to "facts", but does not close the door, and by this opening a pervasive neo-platonician view of the world has engulfed. There are things and facts outhere, just represent them on the Web using URIs and RDF, et voilà. The DBpedia Knowledge Base description contains such typical sentences blurring the ontological status of what is described.
All these [DBpedia] versions together describe 38.3 million things, out of which 23.8 million are localized descriptions of things that also exist in the English version of DBpedia.
It's let to everyone's guess to figure what "existence in the English version" can mean for a thing. What should such documents say instead of "things" and "facts" to avoid such a confusion? Simply what they are, data bases of statements using names (URIs) and sentences (RDF triples) which just copy, translate, adapt, in one word re-present on the Web statements already present in documents and data, in a variety of more or less natural, structured, formal, shared, idiomatic languages. As often stressed here (for five years at least), this representation is just another translation.
And, as for any kind of statements in any language, to figure whether you can trust them or not, you should be able to track their provenance, the context and time of their utterance. That's for example how Wikidata is intended to work. Look at the image below, nothing like a real-world thing or fact is mentioned, but a statement with its claim and context.
The question of the relationship of names and statements with any real-world referents is a deep question open by philosophers for ages, and which should certainly remain open. Or in any case the Web, Linked Data and the Knowledge Graph do not, will not, and should not insidiously, or even with no evil in mind, pretend to close it. Those technologies just provide incredibly efficient ways to exchange, link, access, share statements, based on Web architecture and a minimalist standard grammar. Which is indeed a great achievement, no less, but no more. At the end of the day, data are only data, statements are only statements.

2015-02-23

Common names, proper usage

What follows might be, as previous posts, relevant to the raging debate in and around the W3C Shapes Working Group. If you don't care too much about Latin, Greek, French, German, etymology, translation and languages at large, you can go straight to the last paragraph. But I trust my faithful readers (whoever they are) to follow me through the long preliminary linguistic meanders.

I had a while ago pointed at the enclosure of common names as trademarks. Maybe I should have written common nouns. But in French (my native language), there is a single word nom to translate both noun and name, all being cognates to Latin nomen, Greek ὄνομα, and many more avatars of the same Indo-European root. In French grammar you will say "nom commun" for "common noun" and "nom propre" for "proper noun", and a French native speaker is likely to translate in English "common name" and "proper name", both ambiguous out of context. And my purpose today is indeed to look at what it can mean for names to be common or proper beyond what it means for grammatical nouns.
Let's look into Latin again, where communis and proprius, as well as their ancient Greek equivalents κοινός and ἴδιος have roughly the semantic scope they have kept in French and English. Together they split the world into what belongs to the commons and what is proprietary or private. Beyond and before use in grammar to denote universals and particulars, further meanings have built upon good or bad characteristics associated with each term. Typically, "common" will be used as a derogatory qualifier for whatever belongs to the vulgum pecus, those common people which do not behave, think or speak properly.  The French "propre" even goes further down this derogatory path to mean "clean", with disambiguation by position ("c'est ma propre maison" = "it's my own house" vs "sa maison est propre" = "her house is clean"). Such extensions seem indeed characteristic of a language controlled by some aristocracy. It's worth noticing that the English "own" and its German cognate "Eigen" do not seem to have suffered similar semantic drifts. 
Sticking to the original meaning and forgetting the interpretations of either grammar or aristocracy, common names would be simply names belonging to the commons. Which is true, if you think about it, for just any name. A name with no community (or communality) would be useless, and actually barely a name, just a string with no shared usage and agreed-upon denotation. Under such a definition, even proper nouns are common names. From a grammatical viewpoint, "Roma" is a proper noun, but it's common to all people using it to denote the capital of Italy. To make it short, all names belong to the commons, otherwise they don't name anything at all.
The above analysis does not apply only to natural languages names (aka nouns), but also to all those technical names handled in our information system internal languages, the names used by machines to call each other in the dark (see previous post) and take actions. URIs, addresses, objects and classes names ... if those were not common names, we would have no open Web, and no open source code with reusable libraries.
But those common names, when used and interpreted by software, behave internally at run time as proper names, by all means of "proper". They each call a well defined individual object, method or whatever piece of executable code. A URI sent through the HTTP protocol is eventually calling by their internal names specific pieces of data on one or more servers, all of them running by their own, proper, often proprietary code with its idiosyncratic functional semantics.
Otherwise said, if the declarative semantics of a technical name (description of what it denotes) belongs to the commons, its performative semantics (what it does when called) is proper to the system in which it is used, and conditions at run time.

How is that relevant to the W3C Shapes debate? What this group is (maybe) seeking (or should seek) is actually a (standard) way to describe proper performative semantics for systems using RDF data. On the DC-Architecture list, +Holger Knublauch is complaining a few days ago.
Yet, there used to be a notion of a Semantic Web, in which people were able to publish ontologies together with shared semantics. On this list and also the WG it seems that this has come out of fashion, and everyone seems "obsessed" with the ability to violate the published semantics.
Violate the published semantics? Well, no, it's just about describing how the common semantics behave properly in my system. But whether that can be achieved through yet another declarative language or some interpretation of existing ones without blurring the RDF landscape a bit more, is another story.