0

The code below queries the Wikipedia API for pages in the "Physics" category and converts the response into a Python dictionary.

import ast
import requests
url = "https://en.wikipedia.org/w/api.php?action=query&list=categorymembers&cmtitle=Category:Physics&cmlimit=500&cmcontinue="
response = requests.get(url)
text = response.text
dict = ast.literal_eval(sourceCode)

Here is one of the results returned by the Wikipedia API:

        {
            "pageid": 50724262,
            "ns": 0,
            "title": "Blasius\u2013Chaplygin formula"
        },

The Wikipedia page that "Blasius\u2013Chaplygin formula" corresponds to is https://en.wikipedia.org/wiki/Blasius–Chaplygin_formula.

I want to use the "title" to download pages from Wikipedia. I've replaced all spaces with underscores. But it's failing. I'm doing:

import requests
url = "https://en.wikipedia.org/wiki/Blasius\u2013Chaplygin_formula"
response = requests.get(url)

This gives me:

requests.exceptions.HTTPError: 404 Client Error:
Not Found for url: https://en.wikipedia.org/wiki/Blasius%5Cu2013Chaplygin_formula

How do I change the title Blasius\u2013Chaplygin formula into a URL that can be successfully called by requests?

When I tried to insert the Wikipedia link into this question on Stack Overflow, Stack Overflow automatically converted it to https://en.wikipedia.org/wiki/Blasius%E2%80%93Chaplygin_formula.

When I did:

import requests
url = "https://en.wikipedia.org/wiki/Blasius%E2%80%93Chaplygin_formula"
response = requests.get(url)

it was successful, so I want a library that will do a conversion like this that I can use in Python.

Martin Majlis
  • 363
  • 2
  • 10

2 Answers2

1

To make your life easier you can always use some existing wrapper around Wikipedia API such as Wikipedia-API.

import wikipediaapi
api = wikipediaapi.Wikipedia('en')

# it will shield you from URL encoding problems
p = api.page('Blasius\u2013Chaplygin formula')
print(p.summary)

# and it can make your code shorter
physics = api.page('Category:Physics')
for p in physics.categorymembers.values():
  print(f'[{p.title}]\t{p.summary}')
Martin Majlis
  • 363
  • 2
  • 10
0

That "\u2013" is a unicode character. It gets automatically turned into an en-dash by python, but you can't put en-dashes in wikipedia links, so you have to url encode it, which is what stackoverflow did for you earlier.

You can do it yourself by using something like this:

import requests
import urllib.parse

url = "Blasius\u2013Chaplygin_formula"
response = requests.get("https://en.wikipedia.org/wiki/" + urllib.parse.quote(url))

How to urlencode a querystring in Python?

Turtvaiz
  • 66
  • 2
  • 5