21

How to get Wikipedia page (in a particular language, say French) from the Wikidata Id (ex: Q19675)? The question seems obvious but strangely, I find nothing on the web. I'm looking for a url command that I could use with requests Python module, something like:

url = "https://www.wikidata.org/w/api.php?action=some_method&ids=Q19675"
r = requests.post(url, headers={"User-Agent" : "Magic Browser"})

Someone can help me?

Stanislav Kralin
  • 11,070
  • 4
  • 35
  • 58
Patrick
  • 2,577
  • 6
  • 30
  • 53

4 Answers4

22

You have to use MediaWiki API with action=wbgetentities:

https://www.wikidata.org/w/api.php?action=wbgetentities&format=xml&props=sitelinks&ids=Q19675&sitefilter=frwiki

where:

  • ids=Q19675 - Wikidata item ID
  • sitefilter=frwiki - to get page title only for French Wikipedia

For your example response will be:

<api success="1">
    <entities>
        <entity type="item" id="Q19675">
            <sitelinks>
                <sitelink site="frwiki" title="Musée du Louvre">
                    <badges/>
                </sitelink>
            </sitelinks>
        </entity>
    </entities>
</api>
Termininja
  • 6,620
  • 12
  • 48
  • 49
  • 2
    Thank you for your answer Termininja. So am I correct to say that the url can always be found at fr.wikipedia.org/wiki/{title with spaces replaced by underscores}? – Patrick May 09 '16 at 14:05
  • If you set props to be sitelinks/urls you should get the url without having to normalize yourself – Erik May 17 '18 at 19:32
  • Is there a way to get the extract for a particular language without having to follow the URL in a second call? For example, [this](https://www.wikidata.org/w/api.php?action=wbgetentities&ids=Q5317&languages=en&props=descriptions%7Csitelinks%2Furls) call returns the pages for "Space Needle" Seattle, WA, US. Since I'm specifying the language in the query, I'd expect that they didn't return me a bunch of useless URLs I didn't ask for, and resolve the [English one](https://en.wikipedia.org/wiki/Space_Needle) directly. But I can't seem to find a way to do that. – Abhijit Sarkar Jun 19 '20 at 22:11
  • @AbhijitSarkar, are you looking for this: https://www.wikidata.org/w/api.php?action=wbgetentities&ids=Q5317&languages=en&props=descriptions|sitelinks%2Furls&sitefilter=enwiki – Termininja Jun 21 '20 at 08:12
  • @Termininja Sadly, no. You filtered by language, but the extract isn’t there. https://en.m.wikipedia.org/wiki/Space_Needle. Everything you see until Architecture is the extract that you will find if you follow a particular site link. – Abhijit Sarkar Jun 21 '20 at 08:17
  • It is not possible to get Wikipedia content from Wikidata API – Termininja Jun 21 '20 at 09:37
  • 1
    Use props=sitelinks/urls instead props=sitelinks and in any case you don't want to mess languages, you can always use simpliwiki – Leandro Bardelli Jan 01 '21 at 22:26
4

In python you could do something like this:

def get_wikipedia_url_from_wikidata_id(wikidata_id, lang='en', debug=False):
    import requests
    from requests import utils

    url = (
        'https://www.wikidata.org/w/api.php'
        '?action=wbgetentities'
        '&props=sitelinks/urls'
        f'&ids={wikidata_id}'
        '&format=json')
    json_response = requests.get(url).json()
    if debug: print(wikidata_id, url, json_response) 

    entities = json_response.get('entities')    
    if entities:
        entity = entities.get(wikidata_id)
        if entity:
            sitelinks = entity.get('sitelinks')
            if sitelinks:
                if lang:
                    # filter only the specified language
                    sitelink = sitelinks.get(f'{lang}wiki')
                    if sitelink:
                        wiki_url = sitelink.get('url')
                        if wiki_url:
                            return requests.utils.unquote(wiki_url)
                else:
                    # return all of the urls
                    wiki_urls = {}
                    for key, sitelink in sitelinks.items():
                        wiki_url = sitelink.get('url')
                        if wiki_url:
                            wiki_urls[key] = requests.utils.unquote(wiki_url)
                    return wiki_urls
    return None   

If you run get_wikipedia_url_from_wikidata_id("Q182609", lang='en') you get back: 'https://en.wikipedia.org/wiki/Jacques_Rogge'

If you run get_wikipedia_url_from_wikidata_id("Q182609", lang=None) you get back:

{'afwiki': 'https://af.wikipedia.org/wiki/Jacques_Rogge',
 'alswiki': 'https://als.wikipedia.org/wiki/Jacques_Rogge',
 'arwiki': 'https://ar.wikipedia.org/wiki/جاك_روج',
 'azbwiki': 'https://azb.wikipedia.org/wiki/ژاک_روق',
 'azwiki': 'https://az.wikipedia.org/wiki/Jak_Roqqe',
 'bclwiki': 'https://bcl.wikipedia.org/wiki/Jacques_Rogge',
 'bgwiki': 'https://bg.wikipedia.org/wiki/Жак_Рох',
 'bnwiki': 'https://bn.wikipedia.org/wiki/জ্যাকুয়েস_রগ',
 'cawiki': 'https://ca.wikipedia.org/wiki/Jacques_Rogge',
 'commonswiki': 'https://commons.wikimedia.org/wiki/Category:Jacques_Rogge',
 'cswiki': 'https://cs.wikipedia.org/wiki/Jacques_Rogge',
 'cswikiquote': 'https://cs.wikiquote.org/wiki/Jacques_Rogge',
 'cywiki': 'https://cy.wikipedia.org/wiki/Jacques_Rogge',
 'dawiki': 'https://da.wikipedia.org/wiki/Jacques_Rogge',
 'dewiki': 'https://de.wikipedia.org/wiki/Jacques_Rogge',
 'elwiki': 'https://el.wikipedia.org/wiki/Ζακ_Ρογκ',
 'enwiki': 'https://en.wikipedia.org/wiki/Jacques_Rogge',
 'eowiki': 'https://eo.wikipedia.org/wiki/Jacques_Rogge',
 'eswiki': 'https://es.wikipedia.org/wiki/Jacques_Rogge',
 'etwiki': 'https://et.wikipedia.org/wiki/Jacques_Rogge',
 'fawiki': 'https://fa.wikipedia.org/wiki/ژاک_روگ',
 'fiwiki': 'https://fi.wikipedia.org/wiki/Jacques_Rogge',
 'frwiki': 'https://fr.wikipedia.org/wiki/Jacques_Rogge',
 'hewiki': "https://he.wikipedia.org/wiki/ז'אק_רוג",
 'hrwiki': 'https://hr.wikipedia.org/wiki/Jacques_Rogge',
 'huwiki': 'https://hu.wikipedia.org/wiki/Jacques_Rogge',
 'idwiki': 'https://id.wikipedia.org/wiki/Jacques_Rogge',
 'itwiki': 'https://it.wikipedia.org/wiki/Jacques_Rogge',
 'itwikiquote': 'https://it.wikiquote.org/wiki/Jacques_Rogge',
 'jawiki': 'https://ja.wikipedia.org/wiki/ジャック・ロゲ',
 'kkwiki': 'https://kk.wikipedia.org/wiki/Жак_Рогге',
 'kowiki': 'https://ko.wikipedia.org/wiki/자크_로게',
 'ltwiki': 'https://lt.wikipedia.org/wiki/Jacques_Rogge',
 'lvwiki': 'https://lv.wikipedia.org/wiki/Žaks_Roge',
 'mkwiki': 'https://mk.wikipedia.org/wiki/Жак_Рог',
 'mnwiki': 'https://mn.wikipedia.org/wiki/Жак_Рогге',
 'mswiki': 'https://ms.wikipedia.org/wiki/Jacques_Rogge',
 'nlwiki': 'https://nl.wikipedia.org/wiki/Jacques_Rogge',
 'nowiki': 'https://no.wikipedia.org/wiki/Jacques_Rogge',
 'plwiki': 'https://pl.wikipedia.org/wiki/Jacques_Rogge',
 'ptwiki': 'https://pt.wikipedia.org/wiki/Jacques_Rogge',
 'rowiki': 'https://ro.wikipedia.org/wiki/Jacques_Rogge',
 'ruwiki': 'https://ru.wikipedia.org/wiki/Рогге,_Жак',
 'ruwikinews': 'https://ru.wikinews.org/wiki/Категория:Жак_Рогге',
 'scowiki': 'https://sco.wikipedia.org/wiki/Jacques_Rogge',
 'simplewiki': 'https://simple.wikipedia.org/wiki/Jacques_Rogge',
 'skwiki': 'https://sk.wikipedia.org/wiki/Jacques_Rogge',
 'srwiki': 'https://sr.wikipedia.org/wiki/Жак_Рог',
 'svwiki': 'https://sv.wikipedia.org/wiki/Jacques_Rogge',
 'thwiki': 'https://th.wikipedia.org/wiki/ฌัก_โรคเคอ',
 'tlwiki': 'https://tl.wikipedia.org/wiki/Jacques_Rogge',
 'trwiki': 'https://tr.wikipedia.org/wiki/Jacques_Rogge',
 'ukwiki': 'https://uk.wikipedia.org/wiki/Жак_Рогге',
 'viwiki': 'https://vi.wikipedia.org/wiki/Jacques_Rogge',
 'wuuwiki': 'https://wuu.wikipedia.org/wiki/雅克·罗格',
 'zhwiki': 'https://zh.wikipedia.org/wiki/雅克·罗格'}

Once you get the Wikipedia url you should also resolved them with their redirection.

def get_redirect_urls(wikipedia_url_ids, retry=0, max_retry=10, backoff=2, debug=False):
    import json
    import requests   
    import time

    wikipedia_url_ids_encoded = [requests.utils.quote(x) for x in wikipedia_url_ids]
    request_url=r'https://en.wikipedia.org/w/api.php?action=query&titles={}&&redirects&format=json'.format("|".join(wikipedia_url_ids_encoded))
    if debug: print(request_url)
    response = requests.get(request_url)

    if retry == max_retry:
        return None

    if response.status_code == 429:
        # Too many requests
        wait = backoff * (2 ** (retry - 1))
        time.sleep(wait)
        return get_redirect_urls(wikipedia_url_ids, retry=retry + 1)
    response_json = json.loads(response.text)
    if debug: print(response_json)
    if 'query' not in response_json:
        raise RuntimeError(f'impossible to parse request: {request_url}, response: {reponse_json}')
    normalized = {x['from']: x['to'] for x in response_json.get('query').get('normalized', [])}
    redirects = {x['from']: x['to'] for x in response_json.get('query').get('redirects', [])}

    normalized_url_id_dict = {x: normalized.get(x, x) for x in wikipedia_url_ids}

    return {x: redirects[normalized_url_id_dict[x]].replace(' ', '_') for x in wikipedia_url_ids 
            if normalized_url_id_dict[x] in redirects}

If you remove the hostname prefix and only extract the identifier of the page from the URL path you could get the dictionary of redirected URLs.

"Military_Academy_Commander_in_Chief_Hugo_Rafael_Chávez_Frías","Valley_Children’s_Hospital", 
                  "Park Jung-suk (video game player)", "Valley Children's Hospital",
                   "Troop Officers Military College",
                  "Hobbs_&_Shaw", "Schematic_Records", "Ethan_Klein", "Hobbs & Shaw"], debug=True)

Would print and return:

https://en.wikipedia.org/w/api.php?action=query&titles=Tom_Cruise|Tom%20Cruise|Mark_Daniel_Gangloff|Park_Jung-suk_%28gamer%29|Military_Academy_Commander_in_Chief_Hugo_Rafael_Ch%C3%A1vez_Fr%C3%ADas|Valley_Children%E2%80%99s_Hospital|Park%20Jung-suk%20%28video%20game%20player%29|Valley%20Children%27s%20Hospital|Troop%20Officers%20Military%20College|Hobbs_%26_Shaw|Schematic_Records|Ethan_Klein|Hobbs%20%26%20Shaw&&redirects&format=json
{'batchcomplete': '', 'query': {'normalized': [{'from': 'Tom_Cruise', 'to': 'Tom Cruise'}, {'from': 'Mark_Daniel_Gangloff', 'to': 'Mark Daniel Gangloff'}, {'from': 'Park_Jung-suk_(gamer)', 'to': 'Park Jung-suk (gamer)'}, {'from': 'Military_Academy_Commander_in_Chief_Hugo_Rafael_Chávez_Frías', 'to': 'Military Academy Commander in Chief Hugo Rafael Chávez Frías'}, {'from': 'Valley_Children’s_Hospital', 'to': 'Valley Children’s Hospital'}, {'from': 'Hobbs_&_Shaw', 'to': 'Hobbs & Shaw'}, {'from': 'Schematic_Records', 'to': 'Schematic Records'}, {'from': 'Ethan_Klein', 'to': 'Ethan Klein'}], 'redirects': [{'from': 'Schematic Records', 'to': 'Intelligent dance music'}, {'from': 'Ethan Klein', 'to': 'H3h3Productions'}, {'from': "Valley Children's Hospital", 'to': 'Valley Children’s Hospital'}, {'from': 'Park Jung-suk (video game player)', 'to': 'Park Jung-suk (gamer)'}, {'from': 'Troop Officers Military College', 'to': 'Military Academy Commander in Chief Hugo Rafael Chávez Frías'}, {'from': 'Mark Daniel Gangloff', 'to': 'Mark Gangloff'}], 'pages': {'55632122': {'pageid': 55632122, 'ns': 0, 'title': 'Hobbs & Shaw'}, '43243423': {'pageid': 43243423, 'ns': 0, 'title': 'Military Academy Commander in Chief Hugo Rafael Chávez Frías'}, '3742745': {'pageid': 3742745, 'ns': 0, 'title': 'Park Jung-suk (gamer)'}, '31460': {'pageid': 31460, 'ns': 0, 'title': 'Tom Cruise'}, '63813213': {'pageid': 63813213, 'ns': 0, 'title': 'Valley Children’s Hospital'}, '49550675': {'pageid': 49550675, 'ns': 0, 'title': 'H3h3Productions'}, '81213': {'pageid': 81213, 'ns': 0, 'title': 'Intelligent dance music'}, '9316111': {'pageid': 9316111, 'ns': 0, 'title': 'Mark Gangloff'}}}}

{'Mark_Daniel_Gangloff': 'Mark_Gangloff',
 'Park Jung-suk (video game player)': 'Park_Jung-suk_(gamer)',
 "Valley Children's Hospital": 'Valley_Children’s_Hospital',
 'Troop Officers Military College': 'Military_Academy_Commander_in_Chief_Hugo_Rafael_Chávez_Frías',
 'Schematic_Records': 'Intelligent_dance_music',
 'Ethan_Klein': 'H3h3Productions'}
Gianmario Spacagna
  • 1,270
  • 14
  • 12
2

First, with Python and Pip, install the Wikipedia library but also the Wikidata library:

https://pypi.org/project/Wikidata/

Now, imagine you got your WikidataId (here: Q1617977, for example), so you can read the entity.data from you WikidataId.

Just print the entity.data, you will see it is similar to a json.

So you can iterate into to get other information from your WikidataId, doing something like:

['sitelinks']['frwiki']['url']

from wikidata.client import Client
import wikipedia

client = Client()
entity = client.get('Q1617977', load=True)

url = entity.data['sitelinks']['frwiki']['url']

print(url)
Georgie
  • 108
  • 11
  • 2
    Thank you for contributing an answer. Would you kindly edit your answer to to include an explanation of your code? That will help future readers better understand what is going on, and especially those members of the community who are new to the language and struggling to understand the concepts. That's especially important here when there's already an accepted answer that's been validated by the community. Under what conditions might your approach be preferred? Are you taking advantage of new capabilities? – Jeremy Caney Sep 27 '21 at 00:17
0

the query can also return data JSON format, i.e. for Q228865:

https://www.wikidata.org/w/api.php?action=wbgetentities&format=json&props=sitelinks&ids=Q228865&sitefilter=enwiki

returns:

{"entities":{"Q228865":{"type":"item","id":"Q228865","sitelinks":{"enwiki":{"site":"enwiki","title":"Mia Wasikowska","badges":[]}}}},"success":1}

"title" is then usable for making url; spaces are ok

https://en.wikipedia.org/wiki/Mia Wasikowska

this works, too (auto redirect, or replace spaces with underscores)

https://en.wikipedia.org/wiki/Mia_Wasikowska

dragansr
  • 417
  • 6
  • 8