Is there a Wikipedia API just for retrieve the content summary?

Question

I need just to retrieve the first paragraph of a Wikipedia page.

Content must be HTML formatted, ready to be displayed on my website (so no BBCode, or Wikipedia special code!)

Wikipedia doesn't use BB code, it uses its own wiki markup code. — svick, Dec 19 '11 at 09:53
It doesn't work for every wikipedia article. https://ro.wikipedia.org/w/api.php?action=query&format=json&prop=extracts&titles=FC+Barcelona&exintro=1&explaintext=1&exsectionformat=plain — dumitru, Apr 20 '17 at 04:57

score 244 · Accepted Answer · edited Aug 13 '21 at 05:58

There's a way to get the entire "introduction section" without any HTML parsing! Similar to AnthonyS's answer with an additional explaintext parameter, you can get the introduction section text in plain text.

Query

Getting Stack Overflow's introduction in plain text:

Using the page title:

https://en.wikipedia.org/w/api.php?format=json&action=query&prop=extracts&exintro&explaintext&redirects=1&titles=Stack%20Overflow

Or use pageids:

https://en.wikipedia.org/w/api.php?format=json&action=query&prop=extracts&exintro&explaintext&redirects=1&pageids=21721040

JSON Response

(warnings stripped)

{
    "query": {
        "pages": {
            "21721040": {
                "pageid": 21721040,
                "ns": 0,
                "title": "Stack Overflow",
                "extract": "Stack Overflow is a privately held website, the flagship site of the Stack Exchange Network, created in 2008 by Jeff Atwood and Joel Spolsky, as a more open alternative to earlier Q&A sites such as Experts Exchange. The name for the website was chosen by voting in April 2008 by readers of Coding Horror, Atwood's popular programming blog.\nIt features questions and answers on a wide range of topics in computer programming. The website serves as a platform for users to ask and answer questions, and, through membership and active participation, to vote questions and answers up or down and edit questions and answers in a fashion similar to a wiki or Digg. Users of Stack Overflow can earn reputation points and \"badges\"; for example, a person is awarded 10 reputation points for receiving an \"up\" vote on an answer given to a question, and can receive badges for their valued contributions, which represents a kind of gamification of the traditional Q&A site or forum. All user-generated content is licensed under a Creative Commons Attribute-ShareAlike license. Questions are closed in order to allow low quality questions to improve. Jeff Atwood stated in 2010 that duplicate questions are not seen as a problem but rather they constitute an advantage if such additional questions drive extra traffic to the site by multiplying relevant keyword hits in search engines.\nAs of April 2014, Stack Overflow has over 2,700,000 registered users and more than 7,100,000 questions. Based on the type of tags assigned to questions, the top eight most discussed topics on the site are: Java, JavaScript, C#, PHP, Android, jQuery, Python and HTML."
            }
        }
    }
}

Documentation: API: query/prop=extracts

It is very recommendable to use **&redirects=1** which redirects automatically to content of synonyms — joecks, Feb 13 '16 at 14:42
How can I get information from this JSON response if I don't know pages number. I can't access JSON array containing "extract" — Laurynas G, Mar 10 '16 at 22:35
@LaurynasG You can cast the object to an array and then grab it like this: $extract = current((array)$json_query->query->pages)->extract — MarcGuay, Mar 15 '16 at 21:57
@LaurynasG, @MarcGuay You can also add `[indexpageids](https://www.mediawiki.org/wiki/API:Query#Getting_a_list_of_page_IDs) as a parameter to the URL to get a list of pageids for easier iteration. — Rami, Mar 29 '16 at 20:52
I got the json output from the wiki call and then casted the json to array $data = json_decode($json, true). Then I tries to get the 'extract' using `$extract = current((array)$data->query->pages)->extract;`. but "Notice: Trying to get property of non-object" keeps on coming. — shikhar bansal, Jun 03 '16 at 10:37
Is there a way to make the same request using the pageid instead of the title? — cglacet, Aug 10 '20 at 09:02
@cglacet yup. Just use the `pageids=` query parameter like so `https://en.wikipedia.org/w/api.php?format=json&action=query&prop=extracts&exintro&explaintext&redirects=1&pageids=21721040` — Mike Rapadas, Aug 12 '20 at 20:48
One more question... on e.g. openstreetmap i get a wikipedia entry e.g. like: `de:Stüdlhütte`. How would I need to put that into the query for it to return me a proper result? — Georg, Sep 23 '21 at 12:38

score 87 · Answer 2 · edited Jun 02 '22 at 09:12

There is actually a very nice prop called extracts that can be used with queries designed specifically for this purpose.

Extracts allow you to get article extracts (truncated article text). There is a parameter called exintro that can be used to retrieve the text in the zeroth section (no additional assets like images or infoboxes). You can also retrieve extracts with finer granularity such as by a certain number of characters (exchars) or by a certain number of sentences (exsentences).

Here is a sample query http://en.wikipedia.org/w/api.php?action=query&prop=extracts&format=json&exintro=&titles=Stack%20Overflow and the API sandbox http://en.wikipedia.org/wiki/Special:ApiSandbox#action=query&prop=extracts&format=json&exintro=&titles=Stack%20Overflow to experiment more with this query.

Please note that, if you want the first paragraph specifically, you still need to do some additional parsing as suggested in the chosen answer. The difference here is that the response returned by this query is shorter than some of the other API queries suggested, because you don't have additional assets such as images in the API response to parse.

Caveat from the docs:

We do not recommend the usage of exsentences. It does not work for HTML extracts and there are many edge cases for which it doesn't exist. For example "Arm. gen. Ing. John Smith was a soldier." will be treated as 4 sentences. We do not plan to fix this.

The first link is (effectively) broken. There isn't "extracts" or "extract" on that page. — Peter Mortensen, Aug 13 '21 at 05:52

lw1.at · Answer 3 · 2023-02-06T13:41:03.420

Since 2017 Wikipedia provides a REST API with better caching. In the documentation you can find the following API which perfectly fits your use case (as it is used by the new Page Previews feature).

https://en.wikipedia.org/api/rest_v1/page/summary/Stack_Overflow returns the following data which can be used to display a summary with a small thumbnail:

{
  "type": "standard",
  "title": "Stack Overflow",
  "displaytitle": "<span class=\"mw-page-title-main\">Stack Overflow</span>",
  "namespace": {
    "id": 0,
    "text": ""
  },
  "wikibase_item": "Q549037",
  "titles": {
    "canonical": "Stack_Overflow",
    "normalized": "Stack Overflow",
    "display": "<span class=\"mw-page-title-main\">Stack Overflow</span>"
  },
  "pageid": 21721040,
  "thumbnail": {
    "source": "https://upload.wikimedia.org/wikipedia/commons/thumb/a/a5/StackOverflow.com_Top_Questions_Page_Screenshot.png/320px-StackOverflow.com_Top_Questions_Page_Screenshot.png",
    "width": 320,
    "height": 144
  },
  "originalimage": {
    "source": "https://upload.wikimedia.org/wikipedia/commons/a/a5/StackOverflow.com_Top_Questions_Page_Screenshot.png",
    "width": 1920,
    "height": 865
  },
  "lang": "en",
  "dir": "ltr",
  "revision": "1136271608",
  "tid": "a5580980-9fe9-11ed-8bcd-ff7b011c142c",
  "timestamp": "2023-01-29T15:28:54Z",
  "description": "Website hosting questions and answers on a wide range of topics in computer programming",
  "description_source": "local",
  "content_urls": {
    "desktop": {
      "page": "https://en.wikipedia.org/wiki/Stack_Overflow",
      "revisions": "https://en.wikipedia.org/wiki/Stack_Overflow?action=history",
      "edit": "https://en.wikipedia.org/wiki/Stack_Overflow?action=edit",
      "talk": "https://en.wikipedia.org/wiki/Talk:Stack_Overflow"
    },
    "mobile": {
      "page": "https://en.m.wikipedia.org/wiki/Stack_Overflow",
      "revisions": "https://en.m.wikipedia.org/wiki/Special:History/Stack_Overflow",
      "edit": "https://en.m.wikipedia.org/wiki/Stack_Overflow?action=edit",
      "talk": "https://en.m.wikipedia.org/wiki/Talk:Stack_Overflow"
    }
  },
  "extract": "Stack Overflow is a question and answer website for professional and enthusiast programmers. It is the flagship site of the Stack Exchange Network. It was created in 2008 by Jeff Atwood and Joel Spolsky. It features questions and answers on a wide range of topics in computer programming. It was created to be a more open alternative to earlier question and answer websites such as Experts-Exchange. Stack Overflow was sold to Prosus, a Netherlands-based consumer internet conglomerate, on 2 June 2021 for $1.8 billion.",
  "extract_html": "<p><b>Stack Overflow</b> is a question and answer website for professional and enthusiast programmers. It is the flagship site of the Stack Exchange Network. It was created in 2008 by Jeff Atwood and Joel Spolsky. It features questions and answers on a wide range of topics in computer programming. It was created to be a more open alternative to earlier question and answer websites such as Experts-Exchange. Stack Overflow was sold to Prosus, a Netherlands-based consumer internet conglomerate, on 2 June 2021 for $1.8 billion.</p>"
}

By default, it follows redirects (so that /api/rest_v1/page/summary/StackOverflow also works), but this can be disabled with ?redirect=false.

If you need to access the API from another domain you can set the CORS header with &origin= (e.g., &origin=*).

As of 2019: The API seems to return more useful information about the page.

This also includes "type" which is excellent if you need to know if what you searched has a "disambiguation". — Jeel Shah, May 19 '18 at 23:50
I am getting CORS error while trying to access this link from my Angular based application can anyone tell me how to resolve that. — Praveen Ojha, May 31 '18 at 01:53
Is it possible to also query by a wikidata ID? I have some json data I extratcted which looks like `"other_tags" : "\"addr:country\"=>\"CW\",\"historic\"=>\"ruins\",\"name:nl\"=>\"Riffort\",\"wikidata\"=>\"Q4563360\",\"wikipedia\"=>\"nl:Riffort\""` Can we get the extract now by the QID? — Sourav Chatterjee, Feb 24 '19 at 04:45
What @SouravChatterjee asked for, can this API be used to search by page ids? Seems not — Abhijit Sarkar, Jun 19 '20 at 01:06

score 39 · Answer 4 · edited Aug 13 '21 at 06:00

This code allows you to retrieve the content of the first paragraph of the page in plain text.

Parts of this answer come from here and thus here. See MediaWiki API documentation for more information.

// action=parse: get parsed text
// page=Baseball: from the page Baseball
// format=json: in JSON format
// prop=text: send the text content of the article
// section=0: top content of the page

$url = 'http://en.wikipedia.org/w/api.php?format=json&action=parse&page=Baseball&prop=text&section=0';
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_USERAGENT, "TestScript"); // required by wikipedia.org server; use YOUR user agent with YOUR contact information. (otherwise your IP might get blocked)
$c = curl_exec($ch);

$json = json_decode($c);

$content = $json->{'parse'}->{'text'}->{'*'}; // Get the main text content of the query (it's parsed HTML)

// Pattern for first match of a paragraph
$pattern = '#<p>(.*)</p>#Us'; // http://www.phpbuilder.com/board/showthread.php?t=10352690
if(preg_match($pattern, $content, $matches))
{
    // print $matches[0]; // Content of the first paragraph (including wrapping <p> tag)
    print strip_tags($matches[1]); // Content of the first paragraph without the HTML tags.
}

But if you search "coral", the result will be something not required. Is there any other way, so that only the p tags with smmary can be picked up — Deepanshu Goyal, Dec 06 '13 at 09:48

svick · Answer 5 · 2011-12-19T09:55:13.933

Yes, there is. For example, if you wanted to get the content of the first section of the article Stack Overflow, use a query like this:

http://en.wikipedia.org/w/api.php?format=xml&action=query&prop=revisions&titles=Stack%20Overflow&rvprop=content&rvsection=0&rvparse

The parts mean this:

format=xml: Return the result formatter as XML. Other options (like JSON) are available. This does not affect the format of the page content itself, only the enclosing data format.
action=query&prop=revisions: Get information about the revisions of the page. Since we don't specify which revision, the latest one is used.
titles=Stack%20Overflow: Get information about the page Stack Overflow. It's possible to get the text of more pages in one go, if you separate their names by |.
rvprop=content: Return the content (or text) of the revision.
rvsection=0: Return only content from section 0.
rvparse: Return the content parsed as HTML.

Keep in mind that this returns the whole first section including things like hatnotes (“For other uses …”), infoboxes or images.

There are several libraries available for various languages that make working with API easier, it may be better for you if you used one of them.

I dont want the content parsed ad HTML, i just want to get the "plain text" (neither wikipedia code) — sparkle, Jan 12 '12 at 15:43
The API doesn't offer anything like that. And I can understand why: because from the API's perspective, it's not clear what exactly should this "plain text" contain. For example, how should it represent tables, whether to include "[citation needed]", navigational boxes or image descriptions. — svick, Jan 12 '12 at 16:52
Adding `&redirects=true` to the end of the link ensures you get to the destination article, if one exists. — eric.mitchell, Feb 18 '14 at 03:54

score 14 · Answer 6 · edited Aug 13 '21 at 05:56

This is the code I'm using right now for a website I'm making that needs to get the leading paragraphs, summary, and section 0 of off Wikipedia articles, and it's all done within the browser (client-side JavaScript) thanks to the magic of JSONP! --> http://jsfiddle.net/gautamadude/HMJJg/1/

It uses the Wikipedia API to get the leading paragraphs (called section 0) in HTML like so: http://en.wikipedia.org/w/api.php?format=json&action=parse&page=Stack_Overflow&prop=text&section=0&callback=?

It then strips the HTML and other undesired data, giving you a clean string of an article summary. If you want you can, with a little tweaking, get a "p" HTML tag around the leading paragraphs, but right now there is just a newline character between them.

Code:

var url = "http://en.wikipedia.org/wiki/Stack_Overflow";
var title = url.split("/").slice(4).join("/");

// Get leading paragraphs (section 0)
$.getJSON("http://en.wikipedia.org/w/api.php?format=json&action=parse&page=" + title + "&prop=text&section=0&callback=?", function (data) {
    for (text in data.parse.text) {
        var text = data.parse.text[text].split("<p>");
        var pText = "";

        for (p in text) {
            // Remove HTML comment
            text[p] = text[p].split("<!--");
            if (text[p].length > 1) {
                text[p][0] = text[p][0].split(/\r\n|\r|\n/);
                text[p][0] = text[p][0][0];
                text[p][0] += "</p> ";
            }
            text[p] = text[p][0];

            // Construct a string from paragraphs
            if (text[p].indexOf("</p>") == text[p].length - 5) {
                var htmlStrip = text[p].replace(/<(?:.|\n)*?>/gm, '') // Remove HTML
                var splitNewline = htmlStrip.split(/\r\n|\r|\n/); //Split on newlines
                for (newline in splitNewline) {
                    if (splitNewline[newline].substring(0, 11) != "Cite error:") {
                        pText += splitNewline[newline];
                        pText += "\n";
                    }
                }
            }
        }
        pText = pText.substring(0, pText.length - 2); // Remove extra newline
        pText = pText.replace(/\[\d+\]/g, ""); // Remove reference tags (e.x. [1], [4], etc)
        document.getElementById('textarea').value = pText
        document.getElementById('div_text').textContent = pText
    }
});

Do you add this to the client-side script? If so, isn't that XSS? — craig, Jul 28 '14 at 13:21
It has lot of bugs, try this link with your script : https://en.wikipedia.org/wiki/Modular_Advanced_Armed_Robotic_System — rohankvashisht, Sep 01 '16 at 19:48

score 8 · Answer 7 · edited Aug 13 '21 at 06:01

This URL will return summary in XML format.

http://lookup.dbpedia.org/api/search.asmx/KeywordSearch?QueryString=Agra&MaxHits=1

I have created a function to fetch description of a keyword from Wikipedia.

function getDescription($keyword) {
    $url = 'http://lookup.dbpedia.org/api/search.asmx/KeywordSearch?QueryString=' . urlencode($keyword) . '&MaxHits=1';
    $xml = simplexml_load_file($url);
    return $xml->Result->Description;
}

echo getDescription('agra');

score 7 · Answer 8 · edited Aug 13 '21 at 05:43

You can also get content such as the first paragraph via DBPedia which takes Wikipedia content and creates structured information from it (RDF) and makes this available via an API. The DBPedia API is a SPARQL one (RDF-based), but it outputs JSON and it is pretty easy to wrap.

As an example here's a super simple JavaScript library named WikipediaJS that can extract structured content including a summary first paragraph.

You can read more about it in this blog post: WikipediaJS - accessing Wikipedia article data through Javascript

The JavaScript library code can be found in wikipedia.js.

score 2 · Answer 9 · answered Dec 18 '11 at 22:35

2

The abstract.xml.gz dump sounds like the one you want.

answered Dec 18 '11 at 22:35

sarnold

102,305
22
181
238

score 1 · Answer 10 · edited Aug 13 '21 at 05:55

1

My approach was as follows (in PHP):

$url = "whatever_you_need"

$html = file_get_contents('https://en.wikipedia.org/w/api.php?action=opensearch&search='.$url);
$utf8html = html_entity_decode(preg_replace("/U\+([0-9A-F]{4})/", "&#x\\1;", $html), ENT_NOQUOTES, 'UTF-8');

$utf8html might need further cleaning, but that's basically it.

edited Aug 13 '21 at 05:55

Peter Mortensen

30,738
21
105
131

answered Dec 16 '15 at 13:40

Alex

497
5
22

It is better to ask utf8 from the API with &utf8= – TomoMiha Nov 19 '16 at 16:31

score 1 · Answer 11 · edited Aug 13 '21 at 06:06

I tried Michael Rapadas' and @Krinkle's solutions, but in my case I had trouble to find some articles depending of the capitalization. Like here:

https://en.wikipedia.org/w/api.php?format=json&action=query&prop=extracts&exintro=&exsentences=1&explaintext=&titles=Led%20zeppelin

Note I truncated the response with exsentences=1

Apparently "title normalization" was not working correctly:

Title normalization converts page titles to their canonical form. This means capitalizing the first character, replacing underscores with spaces, and changing namespace to the localized form defined for that wiki. Title normalization is done automatically, regardless of which query modules are used. However, any trailing line breaks in page titles (\n) will cause odd behavior and they should be stripped out first.

I know I could have sorted out the capitalization issue easily, but there was also the inconvenience of having to cast the object to an array.

Because I just really wanted the very first paragraph of a well-known and defined search (no risk to fetch info from another articles), I did it like this:

https://en.wikipedia.org/w/api.php?action=opensearch&search=led%20zeppelin&limit=1&format=json

Note in this case I did the truncation with limit=1

This way:

I can access the response data very easily.
The response is quite small.

But we have to keep being careful with the capitalization of our search.

More information: https://www.mediawiki.org/wiki/API:Opensearch

There isn't a user by the name "Krinkle" here. What answer does it refer to? It is one of *"01AutoMonkey"*, *"AnthonyS"*, and *"Alex"*. Please respond by [editing (changing) your answer](https://stackoverflow.com/posts/37487650/edit), not here in comments (***without*** "Edit:", "Update:", or similar - the answer should appear as if it was written today). — Peter Mortensen, Aug 13 '21 at 06:10

score 1 · Answer 12 · edited Aug 13 '21 at 05:37

1

If you are just looking for the text, which you can then split up, but don't want to use the API, take a look at en.wikipedia.org/w/index.php?title=Elephant&action=raw.

edited Aug 13 '21 at 05:37

Peter Mortensen

30,738
21
105
131

answered Mar 18 '12 at 18:04

mr.user1065741

652
3
9
19

"ready to be displayed on my websites (so NO BBCODE, or WIKIPEDIA special CODE!)" And this is exactly the oppsite – paulgavrikov Nov 09 '13 at 21:00

score 0 · Answer 13 · answered Aug 31 '23 at 20:35

0

There's a simpler way now with wikimedia enterprise with the abstract field. https://enterprise.wikimedia.com/docs/data-dictionary/#abstract in the v2/articles endpoint https://enterprise.wikimedia.com/docs/on-demand/

answered Aug 31 '23 at 20:35

chuck reynolds

627
8
16

Is there a Wikipedia API just for retrieve the content summary?

13 Answers13

Query

JSON Response

Linked

Related