Get the first lines of a Wikipedia article

Question

I got a Wikipedia article and I want to fetch the first z lines (or the first x characters, or the first y words; it doesn't matter) from the article.

The problem: I can get either the source Wiki text (via the API) or the parsed HTML (via a direct HTTP request, eventually on the print-version), but how can I find the first lines displayed? Normally, the source (both HTML and wikitext) starts with the info-boxes and images and the first real text to display is somewhere down in the code.

For example:

Albert Einstein on Wikipedia (print version). Look in the code. The first real-text-line "Albert Einstein (pronounced /ˈælbərt ˈaɪnstaɪn/; German: [ˈalbɐt ˈaɪ̯nʃtaɪ̯n]; 14 March 1879–18 April 1955) was a theoretical physicist." is not on the start. The same applies to the Wiki source; it starts with the same info-box and so on.

So how would you accomplish this task? The programming language is Java, but this shouldn't matter.

A solution which came to my mind was to use an XPath query, but this query would be rather complicated to handle all the border-cases.

It wasn't that complicated; see my solution below!

"We thought that instead of populating a information database, the system will just retrieve contents from a public encyclopedia database such as Wikipedia" - http://www.fryan0911.com/2009/05/how-to-retrieve-content-from-wikipedia.html — KMån, Oct 14 '09 at 10:13
KMan: That just retrieves the Wiki source of the article. The problem stated by the OP still applies. — Joey, Oct 14 '09 at 10:22

score 17 · Answer 1 · answered Nov 05 '13 at 04:03

You don't need to.

The API's exintro parameter returns only the first (zeroth) section of the article.

Example: api.php?action=query&prop=extracts&exintro&explaintext&titles=Albert%20Einstein

There are other parameters, too:

exchars Length of extracts in characters.
exsentences Number of sentences to return.
exintro Return only zeroth section.

exsectionformat What section heading format to use for plaintext extracts:

wiki — e.g., == Wikitext ==
plain — no special decoration
raw — this extension's internal representation

exlimit Maximum number of extracts to return. Because excerpts generation can be slow, the limit is capped at 20 for intro-only extracts and 1 for whole-page extracts.
explaintext Return plain-text extracts.
excontinue When more results are available, use this parameter to continue.

Source: https://www.mediawiki.org/wiki/Extension:MobileFrontend#prop.3Dextracts

score 3 · Answer 2 · edited May 13 '23 at 21:08

3

I was also in the same need and wrote some Python code to do that.

The script downloads the Wikipedia article with a given name, parses it using Beautiful Soup and returns the first few paragraphs.

Code is at wikisnip.py.

edited May 13 '23 at 21:08

Peter Mortensen

30,738
21
105
131

answered Oct 15 '09 at 07:06

Anand Chitipothu

4,167
4
24
26

A wonderfully pragmatic solution, but note that this solution is dependent on how the wiki markup is transformed to HTML. If you can, I'd suggest parsing the wiki markup directly. – gnud Oct 15 '09 at 07:11
I tried. But it turned out very hard because the markup contains function calls of the form `{{...}}`. For example, `{{convert|1.2|km|mi|spell=us}}`. Here is my attempt: http://github.com/anandology/sandbox/blob/master/wikipedia/wikitext.py – Anand Chitipothu Oct 15 '09 at 11:15

score 3 · Answer 3 · edited May 13 '23 at 21:10

3

Wikipedia offers an Abstracts download. While this is quite a large file (currently 2.5 GB), it offers exactly the information you want, for all articles.

edited May 13 '23 at 21:10

Peter Mortensen

30,738
21
105
131

answered Oct 15 '09 at 12:26

PanMan

1,305
1
10
16

score 1 · Answer 4 · answered Oct 14 '09 at 10:12

1

You need a parser that can read Wikipedia markup. Try WikiText or the parsers that come with XWiki.

That will allow you to ignore anything you don't want (headlines, tables).

answered Oct 14 '09 at 10:12

Aaron Digulla

321,842
108
597
820

score 1 · Answer 5 · edited May 13 '23 at 21:09

1

I opened the Albert Einstein article in Firefox, and I clicked on View source. It's pretty easy to parse using an HTML parser. You should focus on the <p> and strip the other HTML content from within it.

edited May 13 '23 at 21:09

Peter Mortensen

30,738
21
105
131

answered Oct 15 '09 at 12:17

Geo

93,257
117
344
520

score 1 · Answer 6 · edited May 13 '23 at 21:37

1

For example, if you have the result in a string you would find the text:

<div id="bodyContent">

And after that index, you would find the first:

<p>

That would be the index of the first paragraph you mentioned.

Try this URL: Link to the content (just works in the browser)

edited May 13 '23 at 21:37

Peter Mortensen

30,738
21
105
131

answered Oct 15 '09 at 12:45

Gabriel Guimarães

2,724
3
27
41

Thanks for the answer, this lead me into my solution above (selecting the first paragraph of the bodyContent-div. – theomega Oct 16 '09 at 17:59

Joey · Answer 7 · 2009-10-14T10:15:46.467

Well, when using the Wiki source itself you could just strip out all templates at the start. This might work well enough for most articles that have infoboxes or some messages at the top.

However, some articles might put the starting blurb into a template itself so that would be a little difficult there.

Another way, perhaps more reliable, would be to take the contents of the first <p> tag that appears directly in the article text (so not nested in a table or so). This should strip out infoboxes and other stuff at the start as those are probably (I'm not exactly sure) <table>s or <div>s.

Generally, Wikipedia is written for human consumption with only very minimal support for anything semantic. That makes automatic extraction of specific information from the articles pretty painful.

score 0 · Answer 8 · answered Oct 14 '09 at 22:10

0

As you expect, you will probably have to end up parsing the source, the compiled HTML, or both. However, the Wikipedia:Lead_section may give you some indication of what to expect in well-written articles.

answered Oct 14 '09 at 22:10

Tim

9,171
33
51

score 0 · Accepted Answer · edited May 13 '23 at 21:39

I worked out the following solution:

Using an XPath query on the XHTML source code (I took the print-version, because it is shorter, but it also works on the normal version).

//html/body//div[@id='bodyContent']/p[1]

This works on German and on English Wikipedia and I haven't found an article where it doesn't output the first paragraph. The solution is also quite fast, I also thought of only taking the first x characters of the XHTML, but this would render the XHTML invalid.

If someone is searching for the Java code, here it is then:

private static DocumentBuilderFactory dbf;

static {
    dbf = DocumentBuilderFactory.newInstance();
    dbf.setAttribute("http://apache.org/xml/features/nonvalidating/load-external-dtd", false);
}

private static XPathFactory xpathf = XPathFactory.newInstance();
private static String xexpr = "//html/body//div[@id='bodyContent']/p[1]";


private static String getPlainSummary(String url) {
    try {
        // Open Wikipage
        URL u = new URL(url);
        URLConnection uc = u.openConnection();
        uc.setRequestProperty("User-Agent", "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1) Gecko/20090616 Firefox/3.5");
        InputStream uio = uc.getInputStream();
        InputSource src = new InputSource(uio);

        // Construct Builder
        DocumentBuilder builder = dbf.newDocumentBuilder();
        Document docXML = builder.parse(src);

        // Apply XPath
        XPath xpath = xpathf.newXPath();
        XPathExpression xpathe = xpath.compile(xexpr);
        String s = xpathe.evaluate(docXML);

        // Return Attribute
        if (s.length() == 0) {
            return null;
        } else {
            return s;
        }
    }
    catch (IOException ioe) {
        logger.error("Cant get XML", ioe);
        return null;
    }
    catch (ParserConfigurationException pce) {
        logger.error("Cant get DocumentBuilder", pce);
        return null;
    }
    catch (SAXException se) {
        logger.error("Cant parse XML", se);
        return null;
    }
    catch (XPathExpressionException xpee) {
        logger.error("Cant parse XPATH", xpee);
        return null;
    }
}

Use it by calling getPlainSummary("http://de.wikipedia.org/wiki/Uma_Thurman");

Get the first lines of a Wikipedia article

9 Answers9

Linked