2018-09-05

Half the sky of Wikidata

I thought I was over with this blog where I'd not published for almost two years, but I've been back to linked data lately, through a grandfather's interest in genealogy. For what is genealogy, if not the ancestor of linked data science? Genealogical trees are maybe the first type of semantic graph ever invented. Entities (persons) linked to each other by predicates such as has father, has mother, has child, has sibling, married to, linked to places (of birth, of death, of marriage), points in time (dates of birth, marriage, death), occupations, works etc. One could think that genealogical data would be the first candidate to be exposed as linked open data. But far from it. Most genealogical data is locked in proprietary data bases, and exchanged in formats far from the semantic web standards. The largest of those data bases such as MyHeritage hold billions of records.

In the linked data world, Person is indeed the most represented type of things, but the figures are three orders of magnitude below those of the above quoted giant genealogical data silos. As I write, Wikidata contains over 4,500,000 people. The current exact value can be retrieved from this query thanks to the excellent Wikidata SPARQL interface. That other query retrieves the current number of women (declared of gender female), a little more than 700,000. A similar one yields the number of those declared as male, more than 3,000,000. It lets a number of people of which gender is neither male or female, or not specified in the data base, similar to the number of women.

Let's not nitpick on numbers, and face the obvious fact that Wikidata has a strong gender bias. Less than one person out of five in Wikidata is a woman. This is not of course a deliberate Wikidata policy, but a mirror of how the notability process works at large in our world, not only in Wikipedia (the main source of Wikidata) but also in other data sources such as library authorities. If one applies to the previous queries a supplementary filter such as people with an ISNI or VIAF identifier, the proportion stays about the same. Is this changing with time? Maybe men were more notable in old ages, and the results are more balanced nowadays. Barely. More than half of people identified in Wikidata are born after 1900, and filtering the above queries to select only people of less than 50 years (born since 1968), one finds about 200,000 women for 550,000 men. The ratio has raised up slightly over 25%. A little better, but no big deal. Not half the sky, yet.



Many women can certainly be added to Wikidata, without breaching too much the notability policy. Reading through many Wikipedia articles of so-called notable people (of either gender), one can notice that women linked to them are often quoted and named as mother, spouse, daughter, sister, with elements of description such as birth and death date, and more. But those women have not yet been considered notable enough to be the subject of a separate entry in Wikipedia, and therefore not entered in Wikidata, although often they would provide a missing genealocical link between existing elements.

What about the genealogical relationships figures? Since they are the most ancient and obvious way of linking people, one would think they are very common in Wikidata. Far from it. Less than 10% of all people are linked, as either subject or object, by a parenthood predicate (child, mother, father, sibling, spouse). And focusing on gender again, one can find less than 15,000 mother-daughter links (declared both ways) versus more than 90,000 father-son links. The gender bias shown by the number of relationships is even more obvious than the number of entities.

Things can be done to improve such a situation, using the many existing tools to query Wikidata and report anomalies. For example this missing parent report, listing individuals linked directly to a grandparent without being linked to the in-between parent. In many cases, the missing link can be identified, and added to the data base. Anomaly reports exist for each parenthood relationship. I've started to work on this, one woman at a time.  Half the sky is far away, but I'll do my part.

More detailed introduction to genealogy and linked data, with examples here (in French).