Scraping Wikipedia abstracts in dictionaries for json.dump

Question

I'm looking to scrape (all) pages from SimpleWiki (generally any Wikimedia site if possible) to get its summary (the first few paragraphs that are not in the body of text).

I then want to wrap these files into a dictionary of the form:

{
  "title": "Some Wiki title page",
  "source": "Some Wiki link",
  "summary": "Some Wiki summary..."
}

and then json.dump them.

As an example, I was looking to be able to take a random page such as https://simple.wikipedia.org/wiki/A and then have it in the following form:

{
  "title": "A",
  "source": "https://simple.wikipedia.org/wiki/A",
  "summary": "A or a is the first letter of the English alphabet. ... . A capital a is written "A". Use a capital a at the start of a sentence if writing"
}

I was just wondering whether there's an easy way to do this -- I've searched around (e.g. Wikimedia dumps) but haven't found anything yet.

I guess easiest way is using wiki's API. https://www.mediawiki.org/wiki/API:Main_page — r.burak, Jan 10 '21 at 13:01
You can download the dumps in XML format from the wiki's website. — Luca Angioloni, Jan 10 '21 at 13:42
@MarkLand Yes I was referring to those dumps. Last time I used them (2 years ago) they contained everything and I could parse and scrape data with a a simple Python script that was reading the XML. — Luca Angioloni, Jan 10 '21 at 17:45
@LucaAngioloni Ah, awesome -- do you still happen to have that script lying around? :) — Mark Land, Jan 10 '21 at 17:57
@MarkLand yes I do but it is not Open Source, as it belongs to a company, so I cannot share it. I can give you a hint though. It was using the standard sax parser that python has. — Luca Angioloni, Jan 10 '21 at 17:59

score 1 · Answer 1 · answered Jan 10 '21 at 14:16

1

Thing you are looking for should be Pywikibot: https://www.mediawiki.org/wiki/Manual:Pywikibot/Installation#Install_Pywikibot. In the instalation you can choose which family (wikidata, wikipedia, mediawiki etc.) you are looking for.

answered Jan 10 '21 at 14:16

Leemosh

883
6
19

Scraping Wikipedia abstracts in dictionaries for json.dump

1 Answers1