Regular expression in python: removing brackets with brackets inside

Question

I have a wiktionary dump and struggling with finding appropriate regex pattern to remove the double brackets in the expression. Here is the example of the expressions:

line = "# Test is a cool word {{source|{{nom w pc|Chantal|Bouchard}}, ''La langue et le nombril'', Presses de l'Université de Montréal (PUM), 2020, p. 174}}."

I am looking to remove all of the brackets when is begins with {{source|:

Example :# Test is a cool word.

I tried using re.sub like this line = re.sub("{{source\|.*?}}", "", line )

but I got # Test is a cool word, ''La langue et le nombril'', Presses de l'Université de Montréal (PUM), 2020, p. 174}}.

I could also have another sentence like this line = "# Test is a cool word {{source|Nicolas|Daniel, Presses de l'Université de Montréal 4}}"

Thank you for your help!

Do you mean you only want to do that with `re` library? If you need a regex, best is to use the PyPi regex library. — Wiktor Stribiżew, Dec 21 '20 at 17:48
Can you have text you want to keep after the last `}}` or can you have multiple matches per sentence? — Wiktor Stribiżew, Dec 21 '20 at 17:58

score 1 · Accepted Answer · answered Dec 21 '20 at 18:04

You can install the PyPi regex library (type pip install regex in the terminal/console and press ENTER), and then use

import regex
rx = r"\s*{{source\|(?>[^{}]|({{(?:[^{}]++|(?1))*}}))*}}\s*"
line = "# Test is a cool word {{source|{{nom w pc|Chantal|Bouchard}}, ''La langue et le nombril'', Presses de l'Université de Montréal (PUM), 2020, p. 174}}."
print( regex.sub('', line) )
# => # Test is a cool word.

See the Python demo. The regex is

\s*\{\{source\|(?>[^{}]|(\{\{(?:[^{}]++|(?1))*}}))*}}\s*

See the regex demo. Details:

\s* - zero or more whitespaces
{{source\| - a literal {{source| string
(?>[^{}]|({{(?:[^{}]++|(?1))*}}))* - zero or more repetitions of:
- [^{}] - a char other than { and }
- | - or
- ({{(?:[^{}]++|(?1))*}}) - Group 1 (it is necessary for recursion): {{, zero or more occurrences of any one or more chars other than {{ and }} or the the Group 1 recursed, and then a }} string
}} - a }} string
\s* - zero or more whitespaces.

score 0 · Answer 2 · answered Dec 21 '20 at 18:07

The .*?}} subexpression will find the shortest possible string which ends with }}. If you want to skip pairs of {{...}} you have to say so.

re.sub(r"\{\{source\|(?:\{\{.*?\}\})*.*?\}\}", "", line)

Note also that if you want to extend this to also handle additional levels of nesting, you have to spell that out explicitly, too; regular expressions are really not an adequate tool for handling nested structures, especially not arbitrarily nested structures. (This is a common FAQ.)

score 0 · Answer 3 · answered Dec 21 '20 at 20:51

You could do it without a regular expression and cover all levels of embedded meta data.

line = "# Test is a cool word {{source|{{nom w pc|Chantal|Bouchard}}, ''La langue et le nombril'', Presses de l'Université de Montréal (PUM), 2020, p. 174}}."

from itertools import accumulate
levels = accumulate( (c=="{")-(p=="}") for c,p in zip(line," "+line) )
result = "".join(c for c,level in zip(line,levels) if level==0)

print(result)
# Test is a cool word .

This computes an incremental "level" of embedding that goes up with each "{" and back down after each "}". Characters that are at level zero are part of the actual text and everything else is excluded.

score -1 · Answer 4 · answered Dec 21 '20 at 17:56

-1

You can use the following regex to obtain the desired answer.

\W+{{source\|.*}}

The code will be,

re.sub("\W+{{source\|.*}}", "", line )

The regular expression is almost same as the one in the question except the ?. Removing the ? makes it to match . as many times as possible.

Additionally, to remove the space before the {{}} the \W+ is added.

answered Dec 21 '20 at 17:56

rakinhaider

124
7

1

If the text has multiple `{{source|...}}` occurrences, this will eat all the text between the first one and the last one. – tripleee Dec 21 '20 at 18:09

Regular expression in python: removing brackets with brackets inside

4 Answers4