1

I want to get Wikipedia pages as text.

I looked at the Wikipedia API from here https://en.wikipedia.org/w/api.php which says that in order to get pages as text I need to append this to a page address:

api.php?action=query&meta=siteinfo&siprop=namespaces&format=txt

However, when I try appending this suffix to a normal page's address, the page is not found:

https://en.wikipedia.org/wiki/George_Washington/api.php?action=query&meta=siteinfo&siprop=namespaces&format=txt

Following the instructions from Get Text Content from mediawiki page via API, I tried adding /api.php?action=parse&page=test to the end of the query string. Therefore, I obtained this:

https://en.wikipedia.org/wiki/George_Washington/api.php?action=parse&page=test

However, this doesn't work either.

Community
  • 1
  • 1
bsky
  • 19,326
  • 49
  • 155
  • 270
  • 2
    Possible duplicate of [Get Text Content from mediawiki page via API](http://stackoverflow.com/questions/1625162/get-text-content-from-mediawiki-page-via-api) – Zulu Nov 21 '15 at 14:04
  • Sorry to ask, but did you actually read the instructions you linked to? – leo Nov 22 '15 at 19:20

2 Answers2

3

NB: All this examples are CORS enabled.


Text only

From the precise title, as seen in the wikipedia page url:

https://en.wikipedia.org/w/api.php?action=query&origin=*&prop=extracts&explaintext&titles=Sokolsky_Opening&format=json


Search relevant pages by keywords

Get IDs, get precise titles/url, get some quick text extract;

https://en.wikipedia.org/w/api.php?action=query&prop=extracts&exlimit=max&format=json&exsentences=1&origin=*&exintro=&explaintext=&generator=search&gsrlimit=23&gsrsearch=chess


Wiki page ID

Using the precise title:

https://en.wikipedia.org/w/api.php?action=query&origin=*&prop=pageprops&format=json&titles=Sokolsky_Opening


Full html

By wiki page ID, includes the Wikitext:

https://en.wikipedia.org/w/api.php?action=parse&origin=*&format=json&pageid=100017


Stripped html

Lighter html version, without the Wikitext.

https://en.wikipedia.org/w/api.php?action=query&origin=*&prop=extracts&format=json&titles=Sokolsky_Opening


Cross origin:

About using CORS requests, sometimes it may require 2 calls to the API, to jump between ID and page title.

In a ssl context, we can use fetch to embed some wiki text anywhere.

Example remote .json.

fetch("https://en.wikipedia.org/w/api.php?action=query&origin=*&prop=extracts&explaintext&format=json&titles=Sokolsky_Opening").then(v => v.json()).then((function(v){
    main.innerHTML = v["query"]["pages"]["100017"]["extract"]
    })
  )
<pre id="main" style="white-space: pre-wrap"></pre>

⚠️ This API has some quirks, some pages with heavy contents get truncated sometimes, among other things and possible rate limiting.


Good luck.


NVRM
  • 11,480
  • 1
  • 88
  • 87
  • can you please add what each parameter does exactly? – medic17 Jul 31 '20 at 02:37
  • The api documentation is a little bit of a mess, you have to modify the url as example this is the doc for `action=query` https://mediawiki.org/w/api.php?action=help&modules=query – NVRM Jul 31 '20 at 02:43
  • yea I know it's a mess that's why I'm here :) I was looking for how exactly your first example ended up as mostly plaintext with most of the markup removed. It seems that's the `prop=extracts&explaintext` `extracts` gives minimel HTML and `explaintext` removes the HTML from that – medic17 Jul 31 '20 at 03:12
  • This is the only notes I have, but it looks like this https://www.mediawiki.org/wiki/Extension:TextExtracts ? Otherwise look here all params are detailed https://en.wikipedia.org/wiki/Special:ApiSandbox – NVRM Jul 31 '20 at 04:56
1

You have to use some of these formats: json, jsonfm, none, php, phpfm, rawfm, xml or xmlfm, so txt is not valid format. Also your API link is wrong, use this:

https://en.wikipedia.org/w/api.php?action=query&titles=George_Washington&prop=revisions&rvprop=content&format=xml
Termininja
  • 6,620
  • 12
  • 48
  • 49