Get Text Content from mediawiki page via API

Question

I'm quite new to MediaWiki, and now I have a bit of a problem. I have the title of some Wiki page, and I want to get just the text of a said page using api.php, but all that I have found in the API is a way to obtain the Wiki content of the page (with wiki markup). I used this HTTP request...

/api.php?action=query&prop=revisions&rvlimit=1&rvprop=content&format=xml&titles=test

But I need only the textual content, without the Wiki markup. Is that possible with the MediaWiki API?

I don't have enough of whatever the microcurrency here is called to add an answer to a question this old, but for anyone searching, it's worth noting that the Mediawiki TextExtracts API ( https://www.mediawiki.org/wiki/API:Get_the_contents_of_a_page#Method_3:_Use_the_TextExtracts_API ) gives you just the text contents of an article. (It keeps article headings, but that's relatively easy to regex out.) — sgfit, Jul 05 '20 at 09:52
Not enough microcurrency to edit: Actually, you can can also remove the heading markup. Sample query: https://en.wikipedia.org/w/api.php?action=query&prop=extracts&titles=Pet_door&formatversion=2&explaintext=true&exsectionformat=plain — sgfit, Jul 06 '20 at 01:37

gilly3 · Answer 1 · 2012-04-11T16:35:09.800

75

Use action=parse to get the html:

/api.php?action=parse&page=test

One way to get the text from the html would be to load it into a browser and walk the nodes, looking only for the text nodes, using JavaScript.

edited Apr 11 '12 at 16:35

answered May 27 '11 at 16:50

gilly3

87,962
25
144
176

10

`action=parse` can also return JSON by adding `format=json`. – scai Nov 20 '16 at 09:43
Getting links to the page in results for titles search would be nice. Not sure which query string that is. Also, Hi @gilly3.. :D This answer still helped after a decade. – Mahesh Jul 27 '21 at 15:34
using the REST API is also an option, for getting a parsed html version of a MediaWiki page `/rest.php/v1/page//html` working example: https://www.mediawiki.org/w/rest.php/v1/page/MediaWiki/html – Robis Koopmans Oct 06 '21 at 18:33

eric.mitchell · Answer 2 · 2016-12-27T19:38:57.527

47

The TextExtracts extension of the API does about what you're asking. Use prop=extracts to get a cleaned up response. For example, this link will give you cleaned up text for the Stack Overflow article. What's also nice is that it still includes section tags, so you can identify individual sections of the article.

Just to include a visible link in my answer, the above link looks like:

/api.php?format=xml&action=query&prop=extracts&titles=Stack%20Overflow&redirects=true

Edit: As Amr mentioned, TextExtracts is an extension to MediaWiki, so it won't necessarily be available for every MediaWiki site.

edited Dec 27 '16 at 19:38

answered Feb 18 '14 at 04:05

eric.mitchell

8,817
12
54
92

7

TextExtracts is an extension to MediaWiki. It's available for Wikipedia but not for every MediaWiki installation. https://www.mediawiki.org/wiki/Extension:TextExtracts – Amr Sep 17 '14 at 04:36

score 40 · Answer 3 · edited Nov 20 '16 at 09:48

40

Adding ?action=raw at the end of a MediaWiki page return the latest content in a raw text format. Eg:- https://en.wikipedia.org/wiki/Main_Page?action=raw

edited Nov 20 '16 at 09:48

scai

20,297
4
56
72

answered Mar 06 '14 at 12:49

baijum

1,609
2
20
25

1

I tried this on a page not on wikipedia, and it didn't work. Does this require an extension? – Tim Bird Jun 30 '15 at 17:51
It seems only to work for the English Wikipedia - see [example](https://de.wikipedia.org/wiki/Eurofighter_Typhoon%26action%3Draw) – Martin Thoma Sep 27 '15 at 11:56
1

@MartinThoma If you change `%26action%3Draw` to `?action=raw`, it works. – KST May 10 '16 at 00:57
Is there any way to also get page title in the same request using this method? – Sep 26 '17 at 03:44

score 33 · Answer 4 · edited Jun 11 '15 at 02:16

33

You can get the wiki data in text format from the API by using the explaintext parameter. Plus, if you need to access many titles' information, you can get all the titles' wiki data in a single call. Use the pipe character | to separate each title. For example, this API call will return the data from both the "Google" and "Yahoo" pages:

http://en.wikipedia.org/w/api.php?format=json&action=query&prop=extracts&exlimit=max&explaintext&exintro&titles=Yahoo|Google&redirects=

Parameters:

explaintext: Return extracts as plain text instead of limited HTML.
exlimit=max: Return more than one result. The max is currently 20.
exintro: Return only the content before the first section. If you want the full data, just remove this.
redirects=: Resolve redirect issues.

edited Jun 11 '15 at 02:16

Dan Getz

8,774
6
30
64

answered Jun 10 '15 at 18:31

Anuraj

2,551
21
26

3

This is just perfect. Thanks – lnaia Feb 08 '16 at 23:26
This will give you just the first section, not the whole article's text – Jonathan Morales Vélez Jan 11 '18 at 19:51
We can also use exsectionformat=plain to remove wikitext-style formatting (== like this ==). Source: https://www.mediawiki.org/w/api.php?action=help&modules=query%2Bextracts – Eneas Gesing Aug 10 '20 at 23:08
Can you get the data of a page by the id of this page? – Oleg Yablokov Nov 02 '21 at 10:44

score 11 · Answer 5 · answered Apr 24 '12 at 18:41

11

That's the simplest way: http://en.wikipedia.org/w/api.php?format=xml&action=query&titles=Albert%20Einstein&prop=revisions&rvprop=content

answered Apr 24 '12 at 18:41

Hardest

348
1
3
7

4

Unfortunately, this returns MediaWiki markup, which needs to be parsed in order to retrieve the text. – lightyrs Jul 04 '13 at 07:11

score 7 · Answer 6 · answered Aug 03 '17 at 06:52

7

Python users coming to this question might be interested in the wikipedia module (docs):

import wikpedia
wikipedia.set_lang('de')
page = wikipedia.page('Wikipedia')
print(page.content)

Every formatting, except for sections (==) is striped away.

answered Aug 03 '17 at 06:52

Martin Thoma

124,992
159
614
958

Eric Normand · Accepted Answer · 2009-10-27T13:51:10.357

6

I don't think it is possible using the API to get just the text.

What has worked for me was to request the HTML page (using the normal URL that you would use in a browser) and strip out the HTML tags under the content div.

EDIT:

I have had good results using HTML Parser for Java. It has examples of how to strip out HTML tags under a given DIV.

edited Oct 27 '09 at 13:51

answered Oct 26 '09 at 14:51

Eric Normand

3,806
1
22
26

I have done, the same thing, i have java app, that must recieve the text content of wiki page. When i use api, and recieve wikisyntax page it works very fast, but i need clear Text, i have tried to request the HTML page and strip out the HTML tags, but it works slowly, therefore i have asked about this feature in wiki API. Or maybe you now some good wikisyntax-clear text converter for Java, then i can convert it directly in Java? – Le_Coeur Oct 26 '09 at 15:04
2

The real issue with wikipedia's language is that it is Turing complete. If you look closely at the code of a page, you will notice all sorts of custom functions. The definitions of those functions have to be fetched as well and then interpreted, which might expand to yet more functions. That is why I reverted to html parsing, which contains the complete, rendered text. – Eric Normand Oct 27 '09 at 13:47
2

MediaWiki's wikitext isn't quite Turing complete since the developrs have bravely fought off the editors' demands for looping constructs. But you are correct that to get plain text out of MediaWiki you need to get the HTML and then strip that. You might like to user this `html2txt.pl` tool I made in Perl for that job, or convert it to your favourite language: https://gist.github.com/751910 – hippietrail May 06 '11 at 01:14
A relatively new extension to the API (TextExtracts) now allows for plain text extraction from an article. See my answer. – eric.mitchell Apr 09 '14 at 00:57

score 4 · Answer 8 · answered Dec 27 '17 at 23:15

4

Use action=render to get the cleanest possible page:

https://wiki.eclipse.org/Tip_of_the_Day/Eclipse_Tips/Now_where_was_I?action=render

vs

https://wiki.eclipse.org/Tip_of_the_Day/Eclipse_Tips/Now_where_was_I

answered Dec 27 '17 at 23:15

Yaza

543
3
9

score -2 · Answer 9 · edited Jun 23 '17 at 19:41

-2

You can do one thing after the contents are brought into your page - you can use the PHP function strip_tags() to remove the HTML tags.

edited Jun 23 '17 at 19:41

Greenonline

1,330
8
23
31

answered Jun 23 '17 at 14:50

user8205791

11

Get Text Content from mediawiki page via API

9 Answers9

Linked

Related