Wikidata - get labels for a large number of ids

Question

I have a list of around 300.000 wikidata ids (e.g. Q1347065, Q731635 etc.) in an ndjson file as

{"Q1347065": ""}
{"Q731635": ""}
{"Q191789": ""} ... etc

What I would like is to get the label of each id, and form a dictionary of key values, such as

{"Q1347065":"epiglottitis", "Q731635":"Mount Vernon", ...} etc.

What I've used before the list of ids got so large, was a Wikidata python library (https://pypi.org/project/Wikidata/)

from wikidata.client import Client
import ndjson

client = Client()
with open("claims.ndjson") as f, open('claims_to_strings.json', 'w') as out:
    claims = ndjson.load(f)

    l = {} 
    for d in claims: 
        l.update(d)

    for key in l:
        v = client.get(key)
        l[key] = str(v.label)

    json.dumps(l, out)

But it is too slow (around 15 hours for 1000 ids). Is there another way to achieve this that is faster than what I have been doing?

how did you get the list of ids? Suppose that you have a list of string ("java", "python", etc) How to get their ids automatically ? — LearnToGrow, Oct 30 '22 at 20:09

score 2 · Accepted Answer · answered Feb 16 '21 at 11:06

Before answering: I don't know what do you mean with json.dumps(r, out); I'm assuming you want json.dump(l, out) instead.

My answer consists in using the following SPARQL query to Wikidata Query Service:

SELECT ?item ?itemLabel WHERE {
  VALUES ?item { wd:Q1347065 wd:Q731635 wd:Q105492052 }
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}

for asking multiple labels at same time.

This speeds up a lot your execution time, because your bottleneck is the number of connections, and with this method the id -> label mapping is entirely done at server side.

import json
import ndjson
import re
import requests

def wikidata_query(query):
    url = 'https://query.wikidata.org/sparql'
    try:
        r = requests.get(url, params = {'format': 'json', 'query': query})
        return r.json()['results']['bindings']
    except json.JSONDecodeError as e:
        raise Exception('Invalid query')

with open("claims.ndjson") as f, open('claims_to_strings.json', 'w') as out:
    claims = ndjson.load(f)

    l = {} 
    for d in claims: 
        l.update(d)
    
    item_ids = l.keys()
    sparql_values = list(map(lambda id: "wd:" + id, item_ids))
    item2label = wikidata_query('''
        SELECT ?item ?itemLabel WHERE {
        VALUES ?item { %s }
        SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
    }''' % " ".join(sparql_values))

    for result in item2label :
        item = re.sub(r".*[#/\\]", "", result['item']['value'])
        label = result['itemLabel']['value']
        l[item] = label
    
    json.dump(l, out)

I guess you cannot do a single query for all 300.000 items, but you can easily find a maximum supported number of accepted ids and split your original id list according to that number.

Yes I meant l instead of r, thanks for catching that, and thanks for the suggestion! I'll try it out now! — Paschalis, Feb 16 '21 at 14:36
@Paschalis Ok, but check also the difference between `dumps` and `dump` methods! — logi-kal, Feb 16 '21 at 14:39

Wikidata - get labels for a large number of ids

1 Answers1