2018-09-28

The stubborn arithmetic of cousins

When you get involved in genealogy, sooner or later, after days, weeks, months or years of patient research (depending on how lucky and obstinate you are) you discover that your best friend, your boss, the old lady next door, your favourite writer or singer, your loyal enemy or the latest serial killer, all are to some degree your cousin. 
Actually you knew this had to be true, in theory at least. All humans have common ancestors, somewhere in the past. But it's a completely different story to be able to identify and name them, and figure how far ago that was. It might be quite easy if you and I belong to families who have kept their genealogical records over centuries, and can proudly show their lineage tracing up to Charlemagne. No big deal, actually, since anyone tracing her ascendance thus far is likely to be in the same case. According to the genealogical database Roglo, the identified descendants of Charlemagne are more than 1,500,000, more than 20% of the database of about 7,500,000 people. But if your ancestors are, like mine, obscure and illiterate peasants, we are likely to stumble upon the lack of documents beyond the few last centuries of church and civil registries, lucky enough if we can reach as far in the past as around 1600 for some more or less reliable information about a handful of ancestors. This seems quite far away, but it's only about a dozen of generations, which means a few thousands of people. 
So, how far have you to go to find common ancestors with your best friend? Let's have a look at the harsh reality of numbers. The number of your ancestors at generation n is 2^n. You have two parents, four grand-parents, and so on. Counting thirty years for each generation (give or take a few), ten generations span three centuries. At the tenth generation you count 2^10 ancestors, which is about one thousand. Being born in the 1950's means I had around one thousand ancestors living around 1650 (under Louis XIV). Three centuries and ten generations before, it was one million around 1350 (under Jean II le Bon), and the same stubborn arithmetic leads to one billion ancestors around 1050 (under Henri Ier). Like in the famous wheat and chessboard problem, the exponential law makes figures explode beyond control at some point. Except that no one can have as many ancestors as one billion in 1050, because the entire world population by that time was less than half this figure. The curve of my theoretical number of ancestors crosses the curve of world population somewhere at the beginning of the 12th century. 
What does that means? Any of my ancestors before 1200 is likely to be my ancestor by so many different paths, and is probably your ancestor as well. People tracing their genealogy thus far in the past know they indeed are all cousins. And all of royal ascendance, of course, since along those millions of different paths, it's highly probable to find a king or queen. But whether you know it or not, the figures are relentless. You who read those lines, you are very probably my cousin, but we'll also probably never know precisely either at which degree or the name and epoch of our last common ancestor. This is both a fascinating and frustrating conclusion.

2018-09-05

Half the sky of Wikidata

I thought I was over with this blog where I'd not published for almost two years, but I've been back to linked data lately, through a grandfather's interest in genealogy. For what is genealogy, if not the ancestor of linked data science? Genealogical trees are maybe the first type of semantic graph ever invented. Entities (persons) linked to each other by predicates such as has father, has mother, has child, has sibling, married to, linked to places (of birth, of death, of marriage), points in time (dates of birth, marriage, death), occupations, works etc. One could think that genealogical data would be the first candidate to be exposed as linked open data. But far from it. Most genealogical data is locked in proprietary data bases, and exchanged in formats far from the semantic web standards. The largest of those data bases such as MyHeritage hold billions of records.

In the linked data world, Person is indeed the most represented type of things, but the figures are three orders of magnitude below those of the above quoted giant genealogical data silos. As I write, Wikidata contains over 4,500,000 people. The current exact value can be retrieved from this query thanks to the excellent Wikidata SPARQL interface. That other query retrieves the current number of women (declared of gender female), a little more than 700,000. A similar one yields the number of those declared as male, more than 3,000,000. It lets a number of people of which gender is neither male or female, or not specified in the data base, similar to the number of women.

Let's not nitpick on numbers, and face the obvious fact that Wikidata has a strong gender bias. Less than one person out of five in Wikidata is a woman. This is not of course a deliberate Wikidata policy, but a mirror of how the notability process works at large in our world, not only in Wikipedia (the main source of Wikidata) but also in other data sources such as library authorities. If one applies to the previous queries a supplementary filter such as people with an ISNI or VIAF identifier, the proportion stays about the same. Is this changing with time? Maybe men were more notable in old ages, and the results are more balanced nowadays. Barely. More than half of people identified in Wikidata are born after 1900, and filtering the above queries to select only people of less than 50 years (born since 1968), one finds about 200,000 women for 550,000 men. The ratio has raised up slightly over 25%. A little better, but no big deal. Not half the sky, yet.



Many women can certainly be added to Wikidata, without breaching too much the notability policy. Reading through many Wikipedia articles of so-called notable people (of either gender), one can notice that women linked to them are often quoted and named as mother, spouse, daughter, sister, with elements of description such as birth and death date, and more. But those women have not yet been considered notable enough to be the subject of a separate entry in Wikipedia, and therefore not entered in Wikidata, although often they would provide a missing genealocical link between existing elements.

What about the genealogical relationships figures? Since they are the most ancient and obvious way of linking people, one would think they are very common in Wikidata. Far from it. Less than 10% of all people are linked, as either subject or object, by a parenthood predicate (child, mother, father, sibling, spouse). And focusing on gender again, one can find less than 15,000 mother-daughter links (declared both ways) versus more than 90,000 father-son links. The gender bias shown by the number of relationships is even more obvious than the number of entities.

Things can be done to improve such a situation, using the many existing tools to query Wikidata and report anomalies. For example this missing parent report, listing individuals linked directly to a grandparent without being linked to the in-between parent. In many cases, the missing link can be identified, and added to the data base. Anomaly reports exist for each parenthood relationship. I've started to work on this, one woman at a time.  Half the sky is far away, but I'll do my part.

More detailed introduction to genealogy and linked data, with examples here (in French).