38

I'd like to obtain a list of all the titles of all Wikipedia articles. I know there are two possible ways to get content from a Wikimedia powered wiki. One would be the API and the other one would be a database dump.

I'd prefer not to download the wiki dump. First, it's huge, and second, I'm not really experienced with querying databases. The problem with the API on the other hand is that I couldn't figure out a way to only retrieve a list of the article titles and even if it would need > 4 mio requests which would probably get me blocked from any further requests anyway.

So my question is

  1. Is there a way to obtain only the titles of Wikipedia articles via the API?
  2. Is there a way to combine multiple request/queries into one? Or do I actually have to download a Wikipedia dump?
John Strood
  • 1,859
  • 3
  • 26
  • 39
Flavio
  • 1,507
  • 5
  • 17
  • 30
  • 1
    You could try the [API Sandbox](http://en.wikipedia.org/wiki/Special%3aApiSandbox#action=query&prop=extracts&format=json&exintro=&titles=Stack%20Overflow) or an actual [query](http://en.wikipedia.org/w/api.php?action=query&list=allpages&format=json) – chridam Jun 29 '14 at 08:22

3 Answers3

56

The allpages API module allows you to do just that. Its limit (when you set aplimit=max) is 500, so to query all 4.5M articles, you would need about 9000 requests.

But a dump is a better choice, because there are many different dumps, including all-titles-in-ns0 which, as its name suggests, contains exactly what you want (59 MB of gzipped text).

svick
  • 236,525
  • 50
  • 385
  • 514
  • 1
    Awesome, thanks a lot! I was looking for exactly such a dump but wasn't able to find one. I guess one click further in searching for a dump would have brought me to exactly this download :) thanks! – Flavio Jun 29 '14 at 08:52
  • This helped us. Can you give link of page which has list of all dumps? – Vivek Sancheti Nov 08 '16 at 10:42
  • @VivekSancheti [Here is the page listing all English Wikipedia dumps from last month.](https://dumps.wikimedia.org/enwiki/20161020/) – svick Nov 08 '16 at 14:18
  • 2
    What is the difference between the in-ns0 and the non-in-ns0 .gz file? They differ in size as well.. – zwep Sep 06 '17 at 11:07
  • 3
    @zwep The difference is that "in-ns0" only contains information about pages in namespace 0, that is articles. – svick Sep 06 '17 at 11:53
  • Thanks @svick! Is it possible to get the same data dump for other languages? And also a way to link the titles in diff languages of the same article together (Via article ID, for example)? – Lun Oct 17 '18 at 08:02
3

Right now, as per the current statistics the number of articles is around 5.8M. To get the list of pages I did use the AllPages API. However, the number of pages I get is around 14.5M which is ~3 times of what I was expecting. I restricted myself to namespace 0 to get the list. Following is the sample code that I am using:

# get the list of all wikipedia pages (articles) -- English
import sys
from simplemediawiki import MediaWiki

listOfPagesFile = open("wikiListOfArticles_nonredirects.txt", "w")


wiki = MediaWiki('https://en.wikipedia.org/w/api.php')

continueParam = ''
requestObj = {}
requestObj['action'] = 'query'
requestObj['list'] = 'allpages'
requestObj['aplimit'] = 'max'
requestObj['apnamespace'] = '0'

pagelist = wiki.call(requestObj)
pagesInQuery = pagelist['query']['allpages']

for eachPage in pagesInQuery:
    pageId = eachPage['pageid']
    title = eachPage['title'].encode('utf-8')
    writestr = str(pageId) + "; " + title + "\n"
    listOfPagesFile.write(writestr)

numQueries = 1

while len(pagelist['query']['allpages']) > 0:

    requestObj['apcontinue'] = pagelist["continue"]["apcontinue"]
    pagelist = wiki.call(requestObj)


    pagesInQuery = pagelist['query']['allpages']

    for eachPage in pagesInQuery:
        pageId = eachPage['pageid']
        title = eachPage['title'].encode('utf-8')
        writestr = str(pageId) + "; " + title + "\n"
        listOfPagesFile.write(writestr)
        # print writestr


    numQueries += 1

    if numQueries % 100 == 0:
        print "Done with queries -- ", numQueries
        print numQueries

listOfPagesFile.close()

The number of queries fired is around 28900, which results in approx. 14.5M names of the pages.

I also tried the all-titles link mentioned in the above answer. In that case as well I am getting around 14.5M pages.

I thought that this overestimate to the actual number of pages is because of the redirects, and did add the 'nonredirects' option to the request object:

requestObj['apfilterredir'] = 'nonredirects'

After doing that I get only 112340 number of pages. Which is too small as compared to 5.8M.

With the above code I was expecting roughly 5.8M pages, but that doesn't seem to be the case.

Is there any other option that I should be trying to get the actual (~5.8M) set of page names?

jayesh
  • 76
  • 2
  • Is simplemediawiki Python 3 or is it Python 2? – Mark Olson Jan 06 '20 at 15:51
  • If you are getting an error about the print statement, you can avoid that by installing directly from GitHub rather than PyPI: `pip install pip install git+https://github.com/iliana/python-simplemediawiki.git` – Shayan RC Mar 13 '21 at 09:01
-1

Here is an asynchronous program that will generate mediawiki pages titles:

async def wikimedia_titles(http, wiki="https://en.wikipedia.org/"):
    log.debug('Started generating asynchronously wiki titles at {}', wiki)
    # XXX: https://www.mediawiki.org/wiki/API:Allpages#Python
    url = "{}/w/api.php".format(wiki)
    params = {
        "action": "query",
        "format": "json",
        "list": "allpages",
        "apfilterredir": "nonredirects",
        "apfrom": "",
    }

    while True:
        content = await get(http, url, params=params)
        if content is None:
            continue
        content = json.loads(content)

        for page in content["query"]["allpages"]:
            yield page["title"]
        try:
            apcontinue = content['continue']['apcontinue']
        except KeyError:
            return
        else:
            params["apfrom"] = apcontinue
amirouche
  • 7,682
  • 6
  • 40
  • 94