0

I'm looking to scrape (all) pages from SimpleWiki (generally any Wikimedia site if possible) to get its summary (the first few paragraphs that are not in the body of text).

I then want to wrap these files into a dictionary of the form:

{
  "title": "Some Wiki title page",
  "source": "Some Wiki link",
  "summary": "Some Wiki summary..."
}

and then json.dump them.

As an example, I was looking to be able to take a random page such as https://simple.wikipedia.org/wiki/A and then have it in the following form:

{
  "title": "A",
  "source": "https://simple.wikipedia.org/wiki/A",
  "summary": "A or a is the first letter of the English alphabet. ... . A capital a is written "A". Use a capital a at the start of a sentence if writing"
}

I was just wondering whether there's an easy way to do this -- I've searched around (e.g. Wikimedia dumps) but haven't found anything yet.

Mark Land
  • 11
  • 4

1 Answers1

1

Thing you are looking for should be Pywikibot: https://www.mediawiki.org/wiki/Manual:Pywikibot/Installation#Install_Pywikibot. In the instalation you can choose which family (wikidata, wikipedia, mediawiki etc.) you are looking for.

Leemosh
  • 883
  • 6
  • 19