1

I have a wiktionary dump and struggling with finding appropriate regex pattern to remove the double brackets in the expression. Here is the example of the expressions:

line = "# Test is a cool word {{source|{{nom w pc|Chantal|Bouchard}}, ''La langue et le nombril'', Presses de l'Université de Montréal (PUM), 2020, p. 174}}."

I am looking to remove all of the brackets when is begins with {{source|:

Example :# Test is a cool word.

I tried using re.sub like this line = re.sub("{{source\|.*?}}", "", line )

but I got # Test is a cool word, ''La langue et le nombril'', Presses de l'Université de Montréal (PUM), 2020, p. 174}}.

I could also have another sentence like this line = "# Test is a cool word {{source|Nicolas|Daniel, Presses de l'Université de Montréal 4}}"

Thank you for your help!

kavaliero
  • 389
  • 1
  • 4
  • 22

4 Answers4

1

You can install the PyPi regex library (type pip install regex in the terminal/console and press ENTER), and then use

import regex
rx = r"\s*{{source\|(?>[^{}]|({{(?:[^{}]++|(?1))*}}))*}}\s*"
line = "# Test is a cool word {{source|{{nom w pc|Chantal|Bouchard}}, ''La langue et le nombril'', Presses de l'Université de Montréal (PUM), 2020, p. 174}}."
print( regex.sub('', line) )
# => # Test is a cool word.

See the Python demo. The regex is

\s*\{\{source\|(?>[^{}]|(\{\{(?:[^{}]++|(?1))*}}))*}}\s*

See the regex demo. Details:

  • \s* - zero or more whitespaces
  • {{source\| - a literal {{source| string
  • (?>[^{}]|({{(?:[^{}]++|(?1))*}}))* - zero or more repetitions of:
    • [^{}] - a char other than { and }
    • | - or
    • ({{(?:[^{}]++|(?1))*}}) - Group 1 (it is necessary for recursion): {{, zero or more occurrences of any one or more chars other than {{ and }} or the the Group 1 recursed, and then a }} string
  • }} - a }} string
  • \s* - zero or more whitespaces.
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
0

The .*?}} subexpression will find the shortest possible string which ends with }}. If you want to skip pairs of {{...}} you have to say so.

re.sub(r"\{\{source\|(?:\{\{.*?\}\})*.*?\}\}", "", line)

Note also that if you want to extend this to also handle additional levels of nesting, you have to spell that out explicitly, too; regular expressions are really not an adequate tool for handling nested structures, especially not arbitrarily nested structures. (This is a common FAQ.)

tripleee
  • 175,061
  • 34
  • 275
  • 318
0

You could do it without a regular expression and cover all levels of embedded meta data.

line = "# Test is a cool word {{source|{{nom w pc|Chantal|Bouchard}}, ''La langue et le nombril'', Presses de l'Université de Montréal (PUM), 2020, p. 174}}."

from itertools import accumulate
levels = accumulate( (c=="{")-(p=="}") for c,p in zip(line," "+line) )
result = "".join(c for c,level in zip(line,levels) if level==0)

print(result)
# Test is a cool word .

This computes an incremental "level" of embedding that goes up with each "{" and back down after each "}". Characters that are at level zero are part of the actual text and everything else is excluded.

Alain T.
  • 40,517
  • 4
  • 31
  • 51
-1

You can use the following regex to obtain the desired answer.

\W+{{source\|.*}}

The code will be,

re.sub("\W+{{source\|.*}}", "", line )

The regular expression is almost same as the one in the question except the ?. Removing the ? makes it to match . as many times as possible.

Additionally, to remove the space before the {{}} the \W+ is added.

rakinhaider
  • 124
  • 7
  • 1
    If the text has multiple `{{source|...}}` occurrences, this will eat all the text between the first one and the last one. – tripleee Dec 21 '20 at 18:09