How to extract plain text from mediawiki?

Question

I have exported some categories from https://awoiaf.westeros.org/index.php/Special:Export. ~~They come in XML format.~~ I would like plain text from the "Synopsis" sections. You can download the whole thing here (54KB compressed).

A typical Synopsis section looks like:

==Synopsis== [[Catelyn Tully|Catelyn]] listens to the continuous pounding noise of the drums the musicians in the hall are playing. She is seated between [[Ryman Frey]] and [[Roose Bolton]] during the wedding feast. She remarks to herself how joyless the wedding is, and watches as [[Robb Stark|Robb]] dances with several of the Frey maids and [[Edmure Tully|Edmure]] dotes on his soon to be wife, [[Roslin Frey|Roslin]]. Catelyn becomes more wary when she learns that [[Olyvar Frey|Olyvar]], [[Perwyn Frey|Perwyn]], and [[Alesander Frey]] are all not in attendance at the wedding. She notices [[Merrett Frey]] trying to drink the [[Greatjon Umber|Greatjon]] under the table, and finally Lord [[Walder Frey]] calls for the bedding. Robb does not participate as the Greatjon carries a weeping Roslin to the bed chamber.

How can I extract the plain text from all the Synopsis sections?

This isn't XML ... Looks like RST, which really is plaintext with some links thrown in — OneCricketeer, Jul 25 '20 at 08:20
@OneCricketeer Interesting. The linked page says "You can export the text and editing history of a particular page or set of pages wrapped in some XML." . I don't know much about XML myself. I just want the plaintext. — Simd, Jul 25 '20 at 08:24
Plenty of examples if you just search what it looks like... What you pasted is already plaintext, so I'm not understanding the question — OneCricketeer, Jul 25 '20 at 08:25
@OneCricketeer I would like the plain text (so [[Catelyn Tully|Catelyn]] becomes Catelyn Tully for example) from the Synopsis sections. There are too many to copy and edit by hand. — Simd, Jul 25 '20 at 08:27
Use regex for that to find patterns of `[[text|link]]` and replace with just `text`... Not sure why you're calling that "XML" when you say you dont know XML — OneCricketeer, Jul 25 '20 at 08:31
Not in a place wtere I can properly examine the download link, but it looks like the ReST is wrapped in some XML. To do this properly, you'd have to first unwrap the XML scaffolding (probably something fairly simple like `Your ReST content here`) and then apply a ReST parser to the extracted data ... But if your needs are simple, maybe just write your own regex to pull out the Synopsis sections and remove square bracket links. — tripleee, Jul 25 '20 at 08:51
This is not a duplicate of https://stackoverflow.com/q/12883428/407651. I think the question is a bit lazy (shows no real research effort), but the markup in the question is not reStructuredText. It is MediaWiki markup. The linked file ("the whole thing") is indeed an XML file. — mzjn, Jul 25 '20 at 08:53
https://stackoverflow.com/questions/11279589/ doesn't contain any good answer. — Shiplu Mokaddim, Jul 26 '20 at 09:44

Shiplu Mokaddim · Accepted Answer · 2020-07-25T08:37:26.060

1

First, you need to parse it as XML. I recommend using lxml and xpath.

from lxml import etree

tree = etree.parse('file.xml')
expression = '/m:mediawiki/m:page/m:revision/m:text/text()'
namespaces = {"m": "http://www.mediawiki.org/xml/export-0.10/"}
texts = tree.xpath(expression, namespaces=namespaces)

Once you get all text portions, use a regular expression to parse them one by one. Or write your own parser.

edited Jul 25 '20 at 08:37

answered Jul 25 '20 at 08:31

Shiplu Mokaddim

56,364
17
141
187

If the input really was XML, find an existing question with a similar answer, and close as duplicate. At 50k rep you should no longer need to answer questions just for the rep. But in addition, the inpqt very clearly isn't actually XML. – tripleee Jul 25 '20 at 08:41
@tripleee Are you sure the full file I linked isn't XML? – Simd Jul 25 '20 at 09:04

How to extract plain text from mediawiki?

1 Answers1