Regexp assistance needed parsing mediawiki template with Javascript

Question

I'm handling Mediawiki markup with Javascript. I'm trying to remove certain parameters. I'm having trouble getting to exactly the text, and only the text, that I want to remove.

Simplified down, the template text can look something like this:

{{TemplateX
| a =
Foo bar
Blah blah

Fizbin foo[[domain:blah]]

Ipsum lorem[[domain:blah]]
|b =1
|c = 0fillertext
|d = 1alphabet
| e =
| f = 10: One Hobbit
| g = aaaa, bbbb, cccc, dddd
|h = 15000
|i = -15000
| j = Level 4 [[domain:filk|Songs]]
| k =7 fizbin, 8 [[domain:trekkies|Shatners]]
|l = 
|m = 
}}

The best I've come up with so far is

~~/\|\s?(a|b|d|f|j|k|m)([^][^\n\|])+/gm~~

Updated version:

/\|\s?(a|b|d|f|j|k|m)(?:[^\n\|]|[.\n])+/gm

which gives (with the updated regexp):

{{TemplateX


|c = 0fillertext

| e =

| g = aaaa, bbbb, cccc, dddd
|h = 15000
|i = -15000

|Songs]]

|Shatners]]
|l =

But what I'm trying to get is:

{{TemplateX
|c = 0fillertext
| e =
| g = aaaa, bbbb, cccc, dddd
|h = 15000
|i = -15000
|l = 
}}

I can deal with the extraneous newlines, but I still need to make sure that '|Songs]]' and '|Shatners]]' are also matched by the regexp.

Regarding Tgr's comment below,

For my purposes, it is safe to assume that every parameter starts on a new line, where | is the first character on the line, and that no parameter definition includes a | that isn't within a [[foo|bar]] construct. So '\n|' is a safe "start" and "stop" sequence. So the question boils down to, for any given params (a,b,d,f,j,k, and m in the question), I need a regex that matches 'wanted param' in the following:

| [other param 1] = ... 
| [wanted param] = possibly multiple lines and |s that aren't after a newline
| [other param 2]

score 2 · Accepted Answer · edited May 23 '17 at 12:34

2

You can try this below - it is matching on the variables you want to include, not those you want to exclude:

(^{{TemplateX)|\|\s*(c|e|g|h|i|l[ ]*\=[ ]*)(.*)|(}}$)

Tested here.

Edit

I enhanced it to this which I think is a bit better if you compare the two regexes using the diagram tool at regexper.com:

(^{{TemplateX)|(\|[ ]*)(c|e|g|h|i|l)([ ]*\=[ ]*)(.*)|(}}$)

Edit 2

Further to the comments, the regex to match the unwanted parameters is this:

\|[ ]?(a|b|d|f|j|k|m)([ ]*\=[ ]*)((?![\r\n]+\|)[0-9a-zA-Z, \[\]:\|\r\n\t])+

Leveraging this answer - it uses a negative lookahead to only match upto [\r\n]+\| which will in part satisfy the statement that:

So '\n|' is a safe "start" and "stop" sequence

Tested here with the introduction of a few newlines in the parameters to be retained (e.g. g).

The visual explanation:

There is a risk that you may have a parameter value with a character other than

[0-9a-zA-Z, \[\]:\|\r\n\t]

To solve that you would need to update that list.

edited May 23 '17 at 12:34

Community

1
1

answered Apr 08 '17 at 06:26

Robin Mackenzie

18,801
7
38
56

This is doing the opposite of what I'm trying to do. As stated, the initial template is simplified down; with your version, all parameters you want to keep, rather than those you want to remove, would have to be explicitly noted. It also breaks on any parameters with multiple lines (ie, if c had a similar value to a.) – BrianFreud Apr 08 '17 at 13:46
@BrianFreud - perhaps I totally misunderstood the question. I rather thought that if I matched on the remaining parameters this would work as it matched your required output. I updated my answer to attempt to provide a solution for matching on the parameters you want to exclude including newlines i.e. multi-line parameters beginning a, b, d etc are matched and multi-line parameters not in the set a, b, d etc are not matched. – Robin Mackenzie Apr 08 '17 at 14:30
Many thanks; the edit 2 version seems to work well, though I've not yet finished understanding quite how it's working. :D That risk shouldn't be too rough - the parameter names are at least controllable. Good call as well with regards to \r; I don't run into them from a Linux box, but I have had those cause wierdness when copied to textareas from Windows boxes in the past. :) – BrianFreud Apr 08 '17 at 14:55
Question; was there a specific reason you went with [ ] over \s? – BrianFreud Apr 08 '17 at 14:56
Turns out that that risk *is* a problem. parameter name is controllable, but parameter *value*,in my case, may contain English, German, French, Japanese, Korean, Chinese, or a few others. A limit to 0-9a-zA-Z won't work... – BrianFreud Apr 08 '17 at 15:02
Solved the problem; added to the end of the question. Thanks again! – BrianFreud Apr 08 '17 at 15:33
Hi Brian - glad it helped - my reason to use `[ ]` instead of `\s` is because at various points in the regex you specifically want to call out matching a space, or two, and at other points want to call out matching a newline. Therefore thought it was a good idea to be as specific as possible about this give \s will match `[\r\n\t\v\ ]`. – Robin Mackenzie Apr 08 '17 at 15:39
Gotcha. I'm not so worried about mismatches - catching the odd tab instead of a space is probably a good thing. But good reasoning. :) – BrianFreud Apr 09 '17 at 02:03

score 0 · Answer 2 · answered Apr 08 '17 at 13:19

Trying to account for the full flexibility of template language is hopeless. For example, a template could look like

{{TemplateX
| a=1 | b=2 }}

or

{{TemplateX|
| a=1 <nowiki>|</nowiki> b=2 }}

which is completely different (the first one has two parameters, a and b, the second one a single a parameter). Regular expressions are (mostly) context-free and can't grasp constructs like that.

So unless you are sure the template is always used according to the same convention, you are better off using some proper parser such as mwparserfromhell:

import mwparserfromhell
wikicode = mwparserfromhell.parse(text)
for template in wikicode.filter_templates(recursive=True, matches=lambda t: t.name.strip() == 'TemplateX'):
for param in ['a', 'b', 'd', 'f', 'j', 'k', 'm']:
    template.remove(param)
print(wikicode)

(This would require rewriting your code in Python or calling out to a Python backend service. I don't think there is any good wikitext parser in Javascript.)

Alternatively, you can use the parse API with the prop=parsetree to get an XML tree representation of the template and its arguments, which is not that hard to process.

(Reply moved to question) – BrianFreud Apr 08 '17 at 13:52 — BrianFreud, Apr 08 '17 at 13:52

Regexp assistance needed parsing mediawiki template with Javascript

2 Answers2

Edit

Edit 2