I'm looking to scrape (all) pages from SimpleWiki (generally any Wikimedia site if possible) to get its summary (the first few paragraphs that are not in the body of text).
I then want to wrap these files into a dictionary of the form:
{
"title": "Some Wiki title page",
"source": "Some Wiki link",
"summary": "Some Wiki summary..."
}
and then json.dump
them.
As an example, I was looking to be able to take a random page such as https://simple.wikipedia.org/wiki/A and then have it in the following form:
{
"title": "A",
"source": "https://simple.wikipedia.org/wiki/A",
"summary": "A or a is the first letter of the English alphabet. ... . A capital a is written "A". Use a capital a at the start of a sentence if writing"
}
I was just wondering whether there's an easy way to do this -- I've searched around (e.g. Wikimedia dumps) but haven't found anything yet.