i just want to get content (no link, no categories, no images...just text)
1 Answers
There is no way to get "just the text" from the Wikipedia API. You can either download the HTML of the page (if you do this via index.php rather than api.php, use action=render
to avoid downloading all the skin content) or the wikitext (which you can do via the API or by passing action=raw
to index.php); you will then have to parse it yourself to remove the bits you don't want to keep.
In the HTML output, MediaWiki is generally good about adding classes to various interface elements you might want to filter out; the templates and such created by users are perhaps less so (e.g. the hack for table sorting just puts some text in a display:none
span, no class).
To get the wikitext via the API, use prop=revisions
. To get the rendered HTML, use action=parse
.

- 92,546
- 13
- 126
- 145
-
Ok, but so i get also Wikipedia advise on the top of page. Like this "This article needs additional citations for verification." .... How to get "just the text"? isn't a 3rd library or API service? – Leonardo May 08 '11 at 12:11
-
@Leonardo: There is no API service; I don't know of any third-party library. In that particular case, you can strip out the template {{refimprove}} from the wikitext, or you can strip anything with class `metadata` from the HTML source. – Anomie May 08 '11 at 12:27