0

I'm trying to use a regex to insert a template into a page, before all category or interwiki links, but after everything else. So if you have a page that ends like this:

== See Also ==
* [[Link one]]
* [[more link]]
* [//external.link external link]

[[Category:Pages]]
[[de:Spezial Page]]

I want the template {{template}} to be inserted before the [[Category:Pages]] but after everything else.

Note: The last section is not necessarily a list - it could be

== References ==
<references/>

or even something else. The point is to insert it before all category/interwiki links at the end, but after the last section.

What regex can help me do this? I've tried (?P<pre>[\s\S]+)(?P<cats>(?:\[\[[^]]:[^]]\]\])*$) as the matching expression with \g<pre>{{template}}\n\g<cats> as the substituting expression, but that simply inserts it at the very end.

Regex flavor: Python 2

AbyxDev
  • 1,363
  • 16
  • 30
  • https://stackoverflow.com/a/1732454/1394393 Use a real parser. – jpmc26 Nov 12 '17 at 11:17
  • @jpmc26 any "real parser"s I can use to this end? – AbyxDev Nov 12 '17 at 11:27
  • Dunno. I've never used Media Wiki. (But I do know when you're using a text processing engine that isn't really powerful enough for the language you're working with.) [Google](https://www.google.com/search?q=python+mediawiki+parser) turns up a few results that look promising. – jpmc26 Nov 12 '17 at 11:30

3 Answers3

2

Alright, combining jpmc26's comment and mmm's answer, I figured it out:

import re
import mwparserfromhell as mw
#get content of page
wikicode = mw.parse(content)
links = wikicode.filter_wikilinks()
links = list(filter(lambda link: re.match(r'\[\[(Category:|[a-z][a-z]:).*\]\]', links))
wikicode.insert_before(links[0], '{{template}}')
content = str(wikicode)

Sorry for taking your time!

AbyxDev
  • 1,363
  • 16
  • 30
  • Note that many interwiki languages have three-letter codes, and some have fairly irregular names like `be-x-old` or `zh-min-nan` (see full list [here](https://phabricator.wikimedia.org/source/mediawiki/browse/master/languages/data/Names.php)). Also if the wiki language is not English `Category` could be localized. – Tgr Nov 14 '17 at 08:57
  • @Tgr I am aware of that, but in our case interwikis are only two-digit and the wiki language is English. – AbyxDev Nov 14 '17 at 09:14
1

From your example this (==.+\s(?:[\*][\s].+\s)+) for regex and \1{{template}}\n for the substituting expression will work just fine.

Demo: https://regex101.com/r/BPBmFL

But maybe you have more cases that it won't work.

Edit:

Try this regex ((.|\n)*)(\[\[.*\:.*\]\]\n) and this \1{{template}}\n\n\3 substituting.

This way it will find everything until the category/interwiki links and you can insert the {{template}} after all and before the category.

Demo: https://regex101.com/r/Bv14kt/4

mmm
  • 80
  • 1
  • 8
  • Yeah, there's also cases like `==References==\n`, and not specifically See Also. Basically it needs to be inserted immediately after the last section, regardless of the content of the section, but _before_ the category/interwiki links. – AbyxDev Nov 12 '17 at 04:08
  • That worked in your demo, but it inserts the template before the last _two_ category links - see [this](https://regex101.com/r/Bv14kt/5). – AbyxDev Nov 12 '17 at 11:20
0

Actually regexes are powerful enough for this specific task, although in general it is indeed a bad idea to use them for parsing wikitext. Something like

(\[\[(Category|\w{2,3}(-\w+){0,2}):[^\[\]<>]+\]\]\s*)*$

would work.

Tgr
  • 27,442
  • 12
  • 81
  • 118
  • Hmm, very nice. However, I did figure out a solution using mwparserfromhell (which you have recommended to me in the past) (see my answer). I'll keep this in mind though, thank you! – AbyxDev Nov 14 '17 at 08:50