0

I would like to extract (all distinct) names of all persons, i.e named entities that are human, in Wikidata with Python. I have tried different libraries (qwikidata, mwikidata), different get requests and Wikidata's SPARQL Service itself. After a while I understood that a general query like this:

SELECT ?person ?personLabel

WHERE {
    ?person wdt:P31 wd:Q5 .
    ?person rdfs:label ?personLabel. FILTER( LANG(?personLabel)="de, en" )
}

is too huge for the public API. Then I added a combination of limit and offset at the end of the query, e.g.:

ORDER BY ASC(?personLabel)

LIMIT 10000 OFFSET 10000

But no matter what I try I get either a TimeOutError (wikidata service) or json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0) (python)

One idea is to generate multiple datasets with the biological sex property (P21), but for male and female the same problems persists.

Help is much appreciated!

joey11235
  • 53
  • 1
  • 7
  • pagination in SPARQL is slow because not only `order by` is slow, `offset` is more complicated then in SQL. There are `10 016 353` persons in Wikidata (and that's just the direct assertions) - you won't make it via the public SPARQL endpoint. It is a shared service. I'd load it into a you own local triple store, or just use command line tools like `awk` and `sed`. – UninformedUser Jul 15 '22 at 17:10
  • The alternative would be to use the QLever endpoint, which is way faster than the Blazegraph backend of the public Wikidata endpoint: https://qlever.cs.uni-freiburg.de/wikidata – UninformedUser Jul 15 '22 at 17:11
  • by the way, that filter expression is wrong syntax: ` FILTER( LANG(?personLabel)="en, de" )` - it does not expect a comma separated list of lang tags, no idea from where you have this. If you want multiple filters as a logical or, then you have to use `||` and repeat the filter expression: `FILTER(expr1 || expr2)` – UninformedUser Jul 15 '22 at 17:14

0 Answers0