How can I extract all sections of a Wikipedia page in plain text?

Question

I have the following code in python, which extracts only the introduction of the article on "Artificial Intelligence", while instead I would like to extract all sub-sections (History, Goals ...)

import requests

def get_wikipedia_page(page_title):
  endpoint = "https://en.wikipedia.org/w/api.php"
  params = {
    "format": "json",
    "action": "query",
    "prop": "extracts",
    "exintro": "",
    "explaintext": "",
    "titles": page_title
  }
  response = requests.get(endpoint, params=params)
  data = response.json()
  pages = data["query"]["pages"]
  page_id = list(pages.keys())[0]
  return pages[page_id]["extract"]

page_title = "Artificial intelligence"
wikipedia_page = get_wikipedia_page(page_title)

Someone proposed to use another approach that parses html and uses BeautifulSoup to convert to text:

from urllib.request import urlopen
from bs4 import BeautifulSoup

url = "https://en.wikipedia.org/wiki/Artificial_intelligence"
html = urlopen(url).read()
soup = BeautifulSoup(html, features="html.parser")

# kill all script and style elements
for script in soup(["script", "style"]):
    script.extract()    # rip it out

# get text
text = soup.get_text()

# break into lines and remove leading and trailing space on each
lines = (line.strip() for line in text.splitlines())
# break multi-headlines into a line each
chunks = (phrase.strip() for line in lines for phrase in 
line.split("  
"))
# drop blank lines
text = '\n'.join(chunk for chunk in chunks if chunk)

print(text)

This is not a good-enough solution, as it includes all text that appears on the website (like image text), and it includes citations in the text (e.g. [1]), while the first script removes them.

I suspect that the api of wikipedia should offer a more elegant solution, it would be rather weird if one can get only the first section?

hc_dev · Accepted Answer · 2022-12-18T18:37:29.277

Retrieving Wikipedia pages as HTML

Like in our web-browsers we can retrieve the complete Wikipedia page by URL and parse the HTML response with Beautiful Soup.

Wikepedia's API

As alternative we can use the API, see Wikipedia's API documentation.

Extract plain-text

When using the action=query with format=json you can use these 4 options for text-extraction:

titles=Artificial intelligence for the page
prop=extracts use the TextExtracts extension
exintro limits the response to content before the first section heading (remove this to get entire text incl. sections)
explaintext extracts as plain-text response instead of HTML

Example: https://en.wikipedia.org/w/api.php?action=query&format=json&titles=Artificial%20intelligence&prop=extracts&explaintext

Get each section separately

To retrieve the sections use the action=parse with format=json and those options:

page=Artificial intelligence get contents for this page
prop=sections only return sections

There is also an API sandbox where you can try several parameters. The resulting GET request will retrieve all the sections of example page "Artificial intelligence": https://en.wikipedia.org/wiki/Special:ApiSandbox#action=parse&format=json&page=Artificial%20intelligence&prop=sections&formatversion=2

This will respond with a JSON containing all sections:

{
    "parse": {
        "title": "Artificial intelligence",
        "pageid": 1164,
        "sections": [
            {
                "toclevel": 1,
                "level": "2",
                "line": "History",
                "number": "1",
                "index": "1",
                "fromtitle": "Artificial_intelligence",
                "byteoffset": 5987,
                "anchor": "History",
                "linkAnchor": "History"
            }
}

(simplified, only first section kept)

To get the text of one of those sections, specify the section as query-parameter (by id or title), e.g. section=1&sectiontitle=History: https://en.wikipedia.org/wiki/Special:ApiSandbox#action=parse&format=json&page=Artificial_intelligence&section=1&sectiontitle=History&formatversion=2

This retrieves the text (in HTML format):

{
    "parse": {
        "title": "Artificial intelligence",
        "pageid": 1164,
        "revid": 1126677096,
        "text": "<div class=\"mw-parser-output\"><h2><span class=\"mw-headline\" id=\"History\">History</span><span class=\"mw-editsection\"><span class=\"mw-editsection-bracket\">[</span><a href=\"/w/index.php?title=Artificial_intelligence&amp;action=edit&amp;section=1\" title=\"Edit section: History\">edit</a><span class=\"mw-editsection-bracket\">]</span></span></h2>\n<style data-mw-deduplicate=\"TemplateStyles:r1033289096\">.mw-parser-output .hatnote{font-style:italic}.mw-parser-output div.hatnote{padding-left:1.6em;margin-bottom:0.5em}.mw-parser-output .hatnote i{font-style:normal}.mw-parser-output .hatnote+link+.hatnote{margin-top:-0.5em}</style><div role=\"note\" class=\"hatnote navigation-not-searchable\">Main articles: <a href=\"/wiki/History_of_artificial_intelligence\" title=\"History of artificial intelligence\">History of artificial intelligence</a> and <a href=\"/wiki/Timeline_of_artificial_intelligence\" title=\"Timeline of artificial intelligence\">Timeline of artificial intelligence</a>

Note: above response was cut-off to only show a sample of the text.

Although above text contents is formatted as HTML, there might be options to get it as plain-text.

Python code

You can also use Python like

package wikipedia:

import wikpedia

wikipedia.set_lang('en')
page = wikipedia.page('Artificial intelligence')
print(page.content)

a Gist from Sai Kumar Yava (scionoftech) using requests: A small Python Code to get Wikipedia page content in plan text

thank you for all the info! the solution with the wikipedia package worked like a charm. also removing the "exintro" argument in my script did the job, as this asks the api to return only text before the first section — blindeyes, Dec 18 '22 at 18:13

How can I extract all sections of a Wikipedia page in plain text?

1 Answers1

Retrieving Wikipedia pages as HTML

Wikepedia's API

Extract plain-text

Get each section separately

See also

Python code

Linked