3

I am trying to use the Wikimedia Commons Query Service[1] programmatically using Python, but am having trouble authenticating via OAuth 1.

Below is a self contained Python example which does not work as expected. The expected behaviour is that a result set is returned, but instead a HTML response of the login page is returned. You can get the dependencies with pip install --user sparqlwrapper oauthlib certifi. The script should then be given the path to a text file containing the pasted output given after applying for an owner only token[2]. e.g.

Consumer token
    deadbeefdeadbeefdeadbeefdeadbeefdeadbeefdeadbeef
Consumer secret
    deadbeefdeadbeefdeadbeefdeadbeefdeadbeefdeadbeef
Access token
    deadbeefdeadbeefdeadbeefdeadbeefdeadbeefdeadbeef
Access secret
    deadbeefdeadbeefdeadbeefdeadbeefdeadbeefdeadbeef

[1] https://wcqs-beta.wmflabs.org/ ; https://diff.wikimedia.org/2020/10/29/sparql-in-the-shadow-of-structured-data-on-commons/

[2] https://www.mediawiki.org/wiki/OAuth/Owner-only_consumers

import sys
from SPARQLWrapper import JSON, SPARQLWrapper
import certifi
from SPARQLWrapper import Wrapper
from functools import partial
from oauthlib.oauth1 import Client
 
 
ENDPOINT = "https://wcqs-beta.wmflabs.org/sparql"
QUERY = """
SELECT ?file WHERE {
  ?file wdt:P180 wd:Q42 .
}
"""
 
 
def monkeypatch_sparqlwrapper():
    # Deal with old system certificates
    if not hasattr(Wrapper.urlopener, "monkeypatched"):
        Wrapper.urlopener = partial(Wrapper.urlopener, cafile=certifi.where())
        setattr(Wrapper.urlopener, "monkeypatched", True)
 
 
def oauth_client(auth_file):
    # Read credential from file
    creds = []
    for idx, line in enumerate(auth_file):
        if idx % 2 == 0:
            continue
        creds.append(line.strip())
    return Client(*creds)
 
 
class OAuth1SPARQLWrapper(SPARQLWrapper):
    # OAuth sign SPARQL requests

    def __init__(self, *args, **kwargs):
        self.client = kwargs.pop("client")
        super().__init__(*args, **kwargs)
 
    def _createRequest(self):
        request = super()._createRequest()
        uri = request.get_full_url()
        method = request.get_method()
        body = request.data
        headers = request.headers
        new_uri, new_headers, new_body = self.client.sign(uri, method, body, headers)
        request.full_url = new_uri
        request.headers = new_headers
        request.data = new_body
        print("Sending request")
        print("Url", request.full_url)
        print("Headers", request.headers)
        print("Data", request.data)
        return request
 
 
monkeypatch_sparqlwrapper()
client = oauth_client(open(sys.argv[1]))
sparql = OAuth1SPARQLWrapper(ENDPOINT, client=client)
sparql.setQuery(QUERY)
sparql.setReturnFormat(JSON)
results = sparql.query().convert()
 
print("Results")
print(results)

I have also tried without SPARQLWrapper, but just using requests+requests_ouathlib. However, I get the same problem --- HTML for a login page is returned --- so it seems it might actually be a problem with Wikimedia Commons Query Service.

import sys
import requests
from requests_oauthlib import OAuth1


def oauth_client(auth_file):
    creds = []
    for idx, line in enumerate(auth_file):
        if idx % 2 == 0:
            continue
        creds.append(line.strip())
    return OAuth1(*creds)


ENDPOINT = "https://wcqs-beta.wmflabs.org/sparql"
QUERY = """
SELECT ?file WHERE {
  ?file wdt:P180 wd:Q42 .
}
"""


r = requests.get(
    ENDPOINT,
    params={"query": QUERY},
    auth=oauth_client(open(sys.argv[1])),
    headers={"Accept": "application/sparql-results+json"}
)


print(r.text)
logi-kal
  • 7,107
  • 6
  • 31
  • 43

4 Answers4

3

Disclaimer: I'm one of the authors of WCQS (and the author of, apparently a bit misleading, article linked in the question).

That way of authenticating is used for apps authenticating with Wikimedia Commons (or any other wikimedia app), but not with WCQS - which, itself, is an app authenticated with Wikimedia Commons. OAuth in this case is used strictly for a web app to authenticate users, but currently, you're unable to authenticate using OAuth for bots and other applications. Any kind of usage will require user login.

This is the limitation comes from our current setup and infrastructure and we plan to overcome that when we go into production (service is currently released in beta status). Unfortunately, I can't tell you when that happens - but it is important to us.

If you want to try out your bot before that happens, you can always log in the browser and use the token in your code, but it is bound to expire and some point and the process will need to be repeated. A simple modification to your second listing does the trick:

import sys
import requests

ENDPOINT = "https://wcqs-beta.wmflabs.org/sparql"
QUERY = """
SELECT ?file WHERE {
  ?file wdt:P180 wd:Q42 .
}
"""

r = requests.get(
    ENDPOINT,
    params={"query": QUERY},
    headers={"Accept": "application/sparql-results+json", "wcqsSession": "<token retrieved after logging in"}
)


print(r.text)

Note that asking on the the mailing list, directly on irc (freenode:#wikimedia-discovery) or creating a Phabricator ticket is the best way of getting help with WCQS.

1

Why don't you try and see if you can get a SPARQL query answered "by hand", using requests + OAuth etc. and then, if you can, you'll know that you've we've got a bug in SPARQLWrapper as opposed to an issue within your application code.

The requests code should look something like the following + OAuth stuff:


r = requests.get(
    ENDPOINT,
    params={"query": QUERY},
    auth=auth,
    headers={"Accept": "application/sparql-results+json"}
)

Nick

Nicholas Car
  • 1,164
  • 4
  • 7
0

If you're asking for the MediaWiki OAuth v1 authentication

I interpret this as that you're looking for a way to do the OAuth against a WikiMedia site alone (using v1), the rest of your code isn't really part of the question? Correct me if I'm wrong.

You don't specify what kind of application you're developing, there are different ways to authenticate against Wikimedia pages using OAuth, for web applications using either Flask or Django with the correct back-end support.

A more "general" way is to use of the mwoauth library (python-mwoauth), from any application. It is still supported on both Python 3 and Python 2.

I assume the following:

  • The target server has a MediaWiki installation with the OAuth Extension installed.
  • You want to OAuth handshake with this server for authentication purposes.

Using Wikipedia.org as the example target platform:

$ pip install mwoauth

# Find a suitable place, depending on your app to include the authorization code:

from mwoauth import ConsumerToken, Handshaker
from six.moves import input # For compatibility between python 2 and 3

# Construct a "consumer" from the key/secret provided by the MediaWiki site
import config
consumer_token = ConsumerToken(config.consumer_key, config.consumer_secret)

# Construct handshaker with wiki URI and consumer
handshaker = Handshaker("https://en.wikipedia.org/w/index.php",
                        consumer_token)

# Step 1: Initialize -- ask MediaWiki for a temporary key/secret for user
redirect, request_token = handshaker.initiate()

# Step 2: Authorize -- send user to MediaWiki to confirm authorization
print("Point your browser to: %s" % redirect) #
response_qs = input("Response query string: ")

# Step 3: Complete -- obtain authorized key/secret for "resource owner"
access_token = handshaker.complete(request_token, response_qs)
print(str(access_token))

# Step 4: Identify -- (optional) get identifying information about the user
identity = handshaker.identify(access_token)
print("Identified as {username}.".format(**identity))

# Fill in the other stuff :)

I may have misinterpreted your question all together, if so, please shout to me through my left ear.

GitHub:

Use the Source, Luke

Here is a link to the docs, this includes an example using Flask: WikiMedia OAuth - Python

C. Sederqvist
  • 2,830
  • 19
  • 27
  • This is to do with the Wikimedia Commons Query Service specifically, so I need to use a SPARQL client. It's similar to the the Wikidata SPARQL endpoint, but has a different selection of data available. See the links in the question. Since this is just a script for personal use, I am using an owner-only token i.e. no user interaction is needed, but the token can only run against one account. For this reason, no web frameworks need to be involved. It looks like mwoauth is a rather thin wrapper around oauthlib. I am using oauthlib direct since I need to integrate it with SPARQLWrapper, – Frankie Robertson Dec 18 '20 at 07:40
  • 1
    Ok, I see. Never used that, but that's an interesting task to get some experience with. Then I apologize for not digging deep enough into the question. Maybe point out that this is a requirement, so it is more likely you'll get the right answer. – C. Sederqvist Dec 18 '20 at 13:51
  • 1
    Then again if I read the actual question title, it is rather obvious. My bad. – C. Sederqvist Dec 18 '20 at 13:54
0

I would try running your code using a different endpoint. Instead of https://wcqs-beta.wmflabs.org/sparql try using https://query.wikidata.org/sparql. When I use the first endpoint I also get the HTML response of the login page that you were getting, however, when I use the second one I get the correct response:

from SPARQLWrapper import SPARQLWrapper, JSON

endpoint = "https://query.wikidata.org/sparql"
sparql = SPARQLWrapper(endpoint)

# Example query to return a list of movies that Christian Bale has acted in:
query = """
SELECT ?film ?filmLabel (MAX(?pubDate) as ?latest_pubdate) WHERE {
   ?film wdt:P31 wd:Q11424 .
   ?film wdt:P577 ?pubDate .
   ?film wdt:P161 wd:Q45772 .
  SERVICE wikibase:label {
    bd:serviceParam wikibase:language "en" .
  }
 }
GROUP BY ?film ?filmLabel
ORDER BY DESC(?latest_pubdate)
LIMIT 50
"""

sparql.setQuery(query)
sparql.setReturnFormat(JSON)
results = sparql.query().convert()

# Define a quick function to get json into pandas dataframe:
import pandas as pd
from pandas import json_normalize

def df_from_res(j):
    df = json_normalize(j['results']['bindings'])[['filmLabel.value','latest_pubdate.value']]
    df['latest_pubdate.value'] = pd.to_datetime(df['latest_pubdate.value']).dt.date
    return df

df_from_res(results).head(5)


#   filmLabel.value   latest_pubdate.value
# 0 Ford v Ferrari    2019-11-15
# 1 Vice              2019-02-21
# 2 Hostiles          2018-05-31
# 3 The Promise       2017-08-17
# 4 Song to Song      2017-05-25

And this endpoint also works with the requests library in a similar way:

import requests

payload = {'query': query, 'format': 'json'}

results = requests.get(endpoint, params=payload).json()
user6386471
  • 1,203
  • 1
  • 8
  • 17
  • 1
    Thanks for the recommendation, but that endpoint has different information available. In particular it doesn't have all the structured data on Wikimedia Commons available. See: https://diff.wikimedia.org/2020/10/29/sparql-in-the-shadow-of-structured-data-on-commons/ – Frankie Robertson Dec 23 '20 at 14:00
  • 1
    Ah yes, I see what you're going for now. I've had a quick play with trying to connect an app to a new account I made on https://en.wikipedia.beta.wmflabs.org/wiki/ to set up the credentials required, but couldn't find a way to do so like I've been able to previously on the non-beta site. Have you managed to get passed that stage to get your credentials? (consumer_key, consumer_secret, access_token, access_secret). – user6386471 Dec 23 '20 at 20:10
  • Okay I think you've found the solution. I got an account on the wrong wiki! I will accept your answer now, but please incorporate this information about where you need to sign up. ETA: Not tested yet but rushing to give you the bounty before it expires. ETA: To clarify, I was trying to use auth tokens I'd created on non-beta Wikimedia commons. – Frankie Robertson Dec 24 '20 at 09:08
  • Nope. Doesn't work! Although I thought for sure that would do the trick. An owner-only token from https://meta.wikimedia.beta.wmflabs.org/wiki/Special:OAuthConsumerRegistration/propose?wpownerOnly=1 doesn't work for me either. – Frankie Robertson Dec 24 '20 at 09:25
  • Ah sorry about that! I've gone back and tried out a few different approaches to run [this query](https://wcqs-beta.wmflabs.org/#%23defaultView%3AImageGrid%0ASELECT%20%3Ffile%20%3Fimage%20WHERE%20%7B%0A%20%20%3Ffile%20wdt%3AP6243%20wd%3AQ179900%20.%20%0A%20%20%3Ffile%20schema%3AcontentUrl%20%3Furl%20.%0A%20%20%23%20workaround%20to%20show%20the%20images%20in%20an%20image%20grid%0A%20%20bind%28iri%28concat%28%22http%3A%2F%2Fcommons.wikimedia.org%2Fwiki%2FSpecial%3AFilePath%2F%22%2C%20wikibase%3AdecodeUri%28substr%28str%28%3Furl%29%2C53%29%29%29%29%20AS%20%3Fimage%29%0A%7D) programmatically... – user6386471 Dec 24 '20 at 13:09
  • When you run the query in the browser it works, but any attempt to hit the endpoint https://wcqs-beta.wmflabs.org/sparql fails. When in the browser, if you click on `> Code` at the bottom right of the query window, a pop-up comes up displaying code to programmatically run the query in various languages. The most information I could get when trying some of these is when running the Pywikibot code - I get an error message saying that the site does not provide a sparql endpoint. So this could potentially be the reason why all of our attempts are failing. – user6386471 Dec 24 '20 at 13:18
  • This seems strange as the endpoint is usable from the browser... I'll keep looking into it. – user6386471 Dec 24 '20 at 13:38
  • That code button is a good find. Given that gives the same kind of non-working example code, I think we can start to conclude that the problem is probably that OAuth doesn't work with the endpoint, as the documentation states, but rather it only works when a session cookie is present. I will try and chase this up so that a either a documentation bug or implementation bug is filed. – Frankie Robertson Dec 25 '20 at 08:10
  • Yep, I think that's a safe conclusion to make. It would be great to hear what the outcome is if you manage to chase up! I'll keep an eye out for a solution too and will keep you posted if anything comes up. – user6386471 Dec 30 '20 at 23:27