2015-03-19

Text = Data + Style

We used to consider the Web as an hypertext, a smart and wonderful extension of the writing space. It is now rather viewed and used as a huge connected and distributed data base. Search engines tend to become smart query interfaces for direct question-answering, rather than guides to the Web landscape. Writing-reading-browsing the hypertext, which was the main activity on the first Web, is more and more replaced by quick questions asking for quick answers in the form of data, if possible fitting the screen size of a mobile interface, and better encapsulated in applications. Is this the slow death of the Web of Text, killed by the Web of Data?
For a data miner, text is just a sort of primitive and cumbersome way to wrap data, from which the precious content has to be painfully extracted, like a gem from a dumb bedrock. But if you are a writer, you might consider the other way round that data is just what you are left with when you have stripped the text of its rhythm, flavor, eagerness from the writer to get in touch with the reader, in one word, style. Why would one bother about style? +Theodora Karamanlis puts it nicely in her blog Scripta Manum under the title "Writing: Where and How to begin".
You want readers to be able to differentiate you from amongst a group of other writers simply by looking at your  style: the “this-is-me-and-this-is-what-I-think” medium of writing. 
Writing on the Web is weaving, as we have seen in the previous post, and your style in this space is the specific texture you give to it locally, in both modern graphical sense and old meaning of way of weaving. The Web is indeed a unified (hyper)text space where anything can be weaved to anything else, but this is achieved through many local different styles or textures. It would be a pity to see this diversity and wealth drowned in the flood of data.
We've learnt those days that Google is working on a new kind of ranking, based on the quality of data (facts, statements, claims) contained in pages. But do or will search engines include style in their ranking algorithms? Can they measure it, and take it into account in search results and personal recommandations, based on your style or the styles you seem to like? Some algorithms are able to identify writing styles the same way other ones identify people and cats in images, or music performers. If I believe I Write Like I just tried on some posts of this blog, I'm supposed to write like I. Asimov or H.P. Lovecraft. Not sure how I should take that. But such technologies applied to compare blogs' styles could yield interesting results and maybe create new links that would not be discovered otherwise.
The bottom line of our data fanatics here could be that after all, style is just another data layer. I'm not ready yet to buy that. I prefer the metaphor of style as a texture. Data is so boring.

2015-03-11

... something borrowed, something blue

I already mentioned +Teodora Petkova in a recent post. Reading her blog, you'll maybe have as I had several times this "exactly ... that!" feeling you get when stumbling on words looking like they have been stolen from the tip of your tongue or pen. In particular don't miss this piece, with its lovely bride's rhyme metaphor, to be applied to every text we write in order to weave it with the web of all texts.
Something old, something new, something borrowed, something blue
Something old ... how can one write without using something old, since what is older than the very words and language we use to write? And one should use them with due respect and full knowledge of their long history. Let's look at some of those venerable words. Children of the Northern European languages, web and weaving seem to come from the same ancient root, hence Weaving the Web is a kind of pleonasm. And text comes from the Latin texo, texere, textus meaning also to weave, and cognate to the ancient Greek τέχνη, the ancestor of all our technics, technologies and architectures. In the Web technologies the northern germanic warp of words have been interwoven with the southern latin woof, and each new text on the Web is a knot in this amazing tapestry. Our Web of texts is not as bad as I wrote a few years ago, and with its patchy, fuzzy, furry and never-finished look, we love it and want to keep it that way.

Something new ... Text seems to be old out-fashioned stuff those days, it's data and multimedia and applications all over the place. Even the Semantic Web has been redubbed Web of Data by the W3C. And what if after Linked Open Data (2007) and Linked Open Vocabularies (2011), we were opening in 2015 the year of Linked Open Text?

Something borrowed ... Teodora encapsulates all the above with the concept of intertextuality. And that one I definitely borrow and adopt (just added it to the left menu), as well as the following from another great piece.
As every text starts and ends in and with another text and we are never-ending stories reaching out to find possible continuations…
Something blue ... The blue of links indeed, but to make the Linked Open Text happen and deliver its potential, we need certainly more than a shade of blue. As Jean-Michel Maulpoix writes in his Histoire du bleu ... All this blue is not of the same ink.
Tout ce bleu n’est pas de même encre.
On y discerne vaguement des étages et des sortes d’appartements, avec leurs numéros, leurs familles de conditions diverses, leurs papiers peints, leurs photographies, leurs vacances dans les Alpes et leurs terrasses sur l’Atlantique, les satisfactions ordinaires et les complications de leurs vies. La condition du bleu n’est pas la même selon la place qu’il occupe dans l’échelle des êtres, des teintes et des croyances. Les plus humbles se contentent des étages inférieurs avec leurs papiers gras et leurs graffitis : ils ne grimpent guère plus haut que les toits hérissés d’antennes. Les plus heureux volent parfois dans un impeccable azur et jettent sur les cités humaines ce beau regard panoramique qui distrayait autrefois les dieux.
To fly that high, we need indeed to invent and use new shades of blue to paint the links between our texts, and the words where those links are anchored. 

2015-03-02

Could computers invent language?

Artificial intelligence is something about which not a line has been written in these pages in next to two hundred posts and over more than ten years. But I feel today like I should drop a couple of thoughts about it, after exchanges on Google+ around this post by +Gideon Rosenblatt and that one by +Bill Slawski, not to mention recent fears expressed by more famous people.
There are many definitions of artificial intelligence, and I will not quote or choose any. Samely, popular issues I also prefer to let alone, like knowing if computers are able to deal only with data and algorithms, or if they can produce information or even knowledge, or if they think and can individually or collectively accede to consciousness or even wisdom. All those terms are fuzzy enough to allow anyone to write anything and its contrary on such issues. Let's rather look at some concrete applications.
Pattern recognition is one of the great and most popular achievements of artificial intelligence. Programs are now able with quite good performance to translate speech into written language, identify music tracks, cluster similar news, identify people and cats on photographs etc. 
Automatic translation is also quite popular, and working not that bad for simple factual texts, has still hard time dealing with context to solve ambiguity, understand puns and implicit references, all things generally associated with intelligent understanding of a text. 
Question-answering is also making great progress, based on more and more rich and complex knowledge graphs, and translation of natural language question into formal queries.
No doubt algorithms will continue to improve in those domains, with many useful applications and some related and important issues regarding privacy and delegation of decision to algorithms.

All the above tasks deal more or less with the ability of computers to process successfully our languages. But, and this is where I'm bound from the start, there is a fundamental capacity of human intelligence which, as far as I know, has not even began to be mimicked by algorithms. It's the capacity to invent language. It has been largely discussed since Wittgenstein whether a private language is possible or not, but there is no discussion that language has been and still is built collectively through a proceess of collective continuous invention. Anyone can invent a new word or a new linguistic form; whether it will be integrated into the language commons depends of many criteria akin to the ones enabling a new species to expand and survive or disappear. This is the way our languages constantly evolve and adapt to the changing world of our communication and discourse needs. Could computers be able to mimick such a process, take part in it, and even expand it further than humans? Could algorithms be able to produce new and relevant words, smoothly integrated in the existing language, to name concepts not yet discovered or named? In short, are computers able to take part in the continuous invention of language, and not only make a smart use of the existing one?
Such a perspective would be indeed fascinating and certainly scary, insofar as machines inventing collectively such language extensions would not necessarily share them with humans, and even if they do, humans would not necessarily be able to understand them

Whether such an evolution is possible at all or in a foreseeable future is a good question. Whether we should hope for it and work to let it happen, or should fear and prevent it, is yet a more interesting one. But at the very least, those questions we can technically specify, making them much more valuable for assessment and definition of artificial intelligence than vague digressions on whether computers can think, have knowledge or can become conscious. We don't even really know what the latter means for humans, our shared language being the closest proxy we have for whatever is going on in our brainware. So let's assess the progress of artificial intelligence by the same criteria we generally use to assess the human intelligence, its ability to deal with language, from plain naming of things to invention of new concepts.

2015-02-26

Statements are only statements

A few days ago in the comments of this post by +Teodora Petkova on Google+ I promised to +Aaron Bradley a post explaining why I am uneasy with the reference to things in Tim Berners-Lee's reference document defining (in 2006) Linked Data. The challenge was to make it readable by seven-years old kids or marketers, but I'm not sure the following meets this requirement.

When Google launched its Knowledge Graph (in 2012) with the tagline things, not strings, it was not much more than the principles of Linked Data as exposed in the above said document six years before, but implemented as a Google enclosure of mostly public source data, with neither API nor even public reusable URIs. I ranted here about that, and nothing seems to have changed since for that matter.
But something important I missed at the time is a subtle drift between TBL's prose and Google's one. The former speaks about things and information about those things. The latter starts by using also the term information, but switches rapidly to objects and facts.
[The Knowledge Graph] currently contains more than 500 million objects, as well as more than 3.5 billion facts about and relationships between these different objects.
The document uses "thing", "entity" and "object" at various places as apparent broad synonyms, conveying (maybe unwillingly) the (very naive) notion that the Knowledge Graph stands at a neat projection in data of "real-world" well-defined things-entities-objects and proven (true) facts about those. An impression reinforced by the use of expressions such as "Find the right thing". And actually, that's how most people are ready to buy it, "Don't be evil" implies "Don't lie, just facts". In a nutshell, if you want to know (true, proven, quality checked) facts about things, just ask Google. It's used to be just ask Wikipedia, but since the Knowledge Graph taps on Wikipedia, it inherits the trust in its source. But similarly naive presentations can be found here and there uttered by enthusiastic Linked Data supporters. Granted, TBL's discourse avoids reference to "facts", but does not close the door, and by this opening a pervasive neo-platonician view of the world has engulfed. There are things and facts outhere, just represent them on the Web using URIs and RDF, et voilà. The DBpedia Knowledge Base description contains such typical sentences blurring the ontological status of what is described.
All these [DBpedia] versions together describe 38.3 million things, out of which 23.8 million are localized descriptions of things that also exist in the English version of DBpedia.
It's let to everyone's guess to figure what "existence in the English version" can mean for a thing. What should such documents say instead of "things" and "facts" to avoid such a confusion? Simply what they are, data bases of statements using names (URIs) and sentences (RDF triples) which just copy, translate, adapt, in one word re-present on the Web statements already present in documents and data, in a variety of more or less natural, structured, formal, shared, idiomatic languages. As often stressed here (for five years at least), this representation is just another translation.
And, as for any kind of statements in any language, to figure whether you can trust them or not, you should be able to track their provenance, the context and time of their utterance. That's for example how Wikidata is intended to work. Look at the image below, nothing like a real-world thing or fact is mentioned, but a statement with its claim and context.
The question of the relationship of names and statements with any real-world referents is a deep question open by philosophers for ages, and which should certainly remain open. Or in any case the Web, Linked Data and the Knowledge Graph do not, will not, and should not insidiously, or even with no evil in mind, pretend to close it. Those technologies just provide incredibly efficient ways to exchange, link, access, share statements, based on Web architecture and a minimalist standard grammar. Which is indeed a great achievement, no less, but no more. At the end of the day, data are only data, statements are only statements.

2015-02-23

Common names, proper usage

What follows might be, as previous posts, relevant to the raging debate in and around the W3C Shapes Working Group. If you don't care too much about Latin, Greek, French, German, etymology, translation and languages at large, you can go straight to the last paragraph. But I trust my faithful readers (whoever they are) to follow me through the long preliminary linguistic meanders.

I had a while ago pointed at the enclosure of common names as trademarks. Maybe I should have written common nouns. But in French (my native language), there is a single word nom to translate both noun and name, all being cognates to Latin nomen, Greek ὄνομα, and many more avatars of the same Indo-European root. In French grammar you will say "nom commun" for "common noun" and "nom propre" for "proper noun", and a French native speaker is likely to translate in English "common name" and "proper name", both ambiguous out of context. And my purpose today is indeed to look at what it can mean for names to be common or proper beyond what it means for grammatical nouns.
Let's look into Latin again, where communis and proprius, as well as their ancient Greek equivalents κοινός and ἴδιος have roughly the semantic scope they have kept in French and English. Together they split the world into what belongs to the commons and what is proprietary or private. Beyond and before use in grammar to denote universals and particulars, further meanings have built upon good or bad characteristics associated with each term. Typically, "common" will be used as a derogatory qualifier for whatever belongs to the vulgum pecus, those common people which do not behave, think or speak properly.  The French "propre" even goes further down this derogatory path to mean "clean", with disambiguation by position ("c'est ma propre maison" = "it's my own house" vs "sa maison est propre" = "her house is clean"). Such extensions seem indeed characteristic of a language controlled by some aristocracy. It's worth noticing that the English "own" and its German cognate "Eigen" do not seem to have suffered similar semantic drifts. 
Sticking to the original meaning and forgetting the interpretations of either grammar or aristocracy, common names would be simply names belonging to the commons. Which is true, if you think about it, for just any name. A name with no community (or communality) would be useless, and actually barely a name, just a string with no shared usage and agreed-upon denotation. Under such a definition, even proper nouns are common names. From a grammatical viewpoint, "Roma" is a proper noun, but it's common to all people using it to denote the capital of Italy. To make it short, all names belong to the commons, otherwise they don't name anything at all.
The above analysis does not apply only to natural languages names (aka nouns), but also to all those technical names handled in our information system internal languages, the names used by machines to call each other in the dark (see previous post) and take actions. URIs, addresses, objects and classes names ... if those were not common names, we would have no open Web, and no open source code with reusable libraries.
But those common names, when used and interpreted by software, behave internally at run time as proper names, by all means of "proper". They each call a well defined individual object, method or whatever piece of executable code. A URI sent through the HTTP protocol is eventually calling by their internal names specific pieces of data on one or more servers, all of them running by their own, proper, often proprietary code with its idiosyncratic functional semantics.
Otherwise said, if the declarative semantics of a technical name (description of what it denotes) belongs to the commons, its performative semantics (what it does when called) is proper to the system in which it is used, and conditions at run time.

How is that relevant to the W3C Shapes debate? What this group is (maybe) seeking (or should seek) is actually a (standard) way to describe proper performative semantics for systems using RDF data. On the DC-Architecture list, +Holger Knublauch is complaining a few days ago.
Yet, there used to be a notion of a Semantic Web, in which people were able to publish ontologies together with shared semantics. On this list and also the WG it seems that this has come out of fashion, and everyone seems "obsessed" with the ability to violate the published semantics.
Violate the published semantics? Well, no, it's just about describing how the common semantics behave properly in my system. But whether that can be achieved through yet another declarative language or some interpretation of existing ones without blurring the RDF landscape a bit more, is another story. 

2015-02-17

You need names on the Web, it's dark in there.

The chinese character 名 (name) which we have seen in the previous post as the mother of all things, has an interesting origin. It's composed from the characters 夕 (night, symbolized by a crescent moon) and 口 (an open mouth). The clue of such a mysterious association is that you need a name either to call someone, or to identify yourself, in the dark of night. In daylight, you don't really need to know the name of your interlocutor to recognize each other and engage into conversation. You don't need names of things to find and handle them.

Interaction through information systems, and singularly on the Web, is a conversation in the darkest of nights. You can't see your interlocutors, you can't wave or bow at them, and you don't see either what your are looking for, and the system does not see you. So you need names everywhere. You need names to enter the system, to login, to send messages. You need to know names to connect to people on the social web. You need to know a name of what you search to ask a search engine. One can argue that all of this is rapidly changing, with identification using your finger or eyeprint, connecting to stuff or people using icons and various fancy non-textual interfaces. But under the hood, the system will still exchange ids, keys, adresses, all those avatars of names used by machines. If our online experience gets closer and closer to daylight conversation, poor machines will keep  for a long time shooting names to each other across the dark of Web.

2015-02-07

名可名,非常名

My conversation with good old 老子 is a neverending story, and I had to revisit him with the untranslatables paradigm in mind. I discovered long ago the extreme difficulty of translating the chinese characters and singularly in ancient writings through the excellent introduction I already mentioned here some years ago, this "Idiot chinois" by Kyril Ryjik. This book had sold out long ago, my exemplar was lost in a former life, fortunately a few years ago on some obscure blog I stumbled on a PDF copy I was preciously keeping safe ... but I can now forget about all those. After thirty years of dark ages, L'Idiot Chinois is now republished, and this new edition should land on my bookshelves anytime soon ...
The infamous and cryptic first chapter of the 道德經 would certainly be easily short listed in any challenge of the best untranslatables ever. It is an example Ryjik is presenting, because it's both too well known and too much translated, and certainly deeply misunderstood by most western translators.
Here goes the first part, which even if you don't read Chinese will strike you by the rhythm and sheer graphical refinement of its 24 characters. Note that the character 名 (míng, "name") is repeated five times, a hint at this story being about names and naming, mainly. 

道可道,非常道
名可名,非常名
無名天地之始
有名萬物之母

Ryjik holds that all but a few western translations and interpretations project a transcendental interpretation of  which does not make sense in the historical/political/cultural context where this text was produced. This is still the case of many available translations, for which the Dao has too much the look and feel of our western monotheist God. If nothing else, the initial caps everywhere are suspicious, there is no upper-case in Chinese.  should certainly be taken with a more mundane meaning : the way the world is going, and that human beings should try to follow, individually and collectively, in order to live in harmony with the general flow. Only physics, no metaphysics.
With this in mind, Ryjik posits that the negative  in the first sentence should be certainly read as a determinant of 常 (constant, unchanging, regular, in one word steady), rather of the whole group 常道. 
In other words, where most translators read 非(常道) not (steady way) one should rather read (非常)道 (not steady) way. Which makes the whole sentence read  something like (a) way really way is not a steady way. In other words : if you want to conform your way to the way (of the world at large) you have to adapt and change (as the world does). In the historical context, Ryjik holds that this is a moral and political recommandation not to stick to a rigid application of ancient rules despite the situation is everchanging. But this is a general consideration, just put there to introduce the main point of the story : the role of names.
Reading in the same spirit 名可名,非常名 yields name really name is not a steady name. Since things as the world flows are everchanging, the names you give to things are also bound to change to keep their accuracy. And in this spirit I just changed the title of this blog ...
As for the following two sentences which seem more mysterious, I've not been fully convinced by any translation so far, even the one by Ryjik. I'm pushed towards proposing my own translation by a beautiful edition entitled "La Danse de l'Encre", illustrated by Lassaâd Metoui, a tunisian calligraph. Thomas Golsenne writes in the introduction (in French, my translation)
"To read the Tao Te King against the grain, out of context is not only a right granted to the reader, it's a sort of duty  ... Understanding or translating [it] "faithfully" does not make any sense, because there is nothing to be faithful to, nothing but emptiness"
So be it, here goes my own unfaithful version of the two following sentences

無名天地之始  : there is no name at the origin of the universe
有名萬物之母  : having a name is the mother of all things

Which I read : the world as a whole 天地 (sky and earth) exists before and beyond any name, and does not need any name to exist, but with names come the separation in things, this and not-this, one, two and the ten thousand beings like said further on in chapter 42. 道生一,一生二,二生三,三生萬物. Dao is father of one, one is father of two, two is father of three, three is father of the multitude of beings.
I'm not sure we need another subject than 無名 and 有名 in those two sentences, a subject which would be implicitly 道, as most translations have it, like "Without name the Dao is the origin of the Universe" etc ... here comes the Holy Ghost, the Logos and the heavy monotheist capitalization. But the dao has nothing to do with the Holy Ghost. There is no metaphysics in the dao, only physics. 
This is actually somehow akin to the (too noisy) recent thesis of Markus Gabriel "Warum es die Welt nicht gibt". Things exist insofar as they are named, but the world cannot be named as a separate entity because there is nothing from which it could be separated from.

Amazingly enough, there is no entry for name in the Dictionary of Untranslatables. Not even a small entry in the index. This is certainly food for thought to expand in a future post.