3

I'm trying to search and replace citations from pandoc-markdown. They have the following syntax:

[prenote @autorkey, postnote]

Or for more than one Author

[prenote1 @authorekey1, postnote1; prenote2 @authorkey2, postnote2]

The pre-notes, the author-keys and the post-notes should each be in their own capture group.

For only one author in a citation I used regex this:

\[((.*) )?@(.*?)(, (.*))?\]

But I can't figure out how to match a citation with multiple authors. Ideally it would be possible to match citations with one or more author keys. The pre-note and the post-note should be optional.

Is this possible?

Bonschi
  • 31
  • 2
  • 1
    _"The pre-notes, the author-keys and the post-notes should each be in their own capture group"_. What you're trying to do is capturing a dynamic number of capturing groups, ie repeating a capturing group. It won't work this way (Source: a [SO answer](https://stackoverflow.com/a/3537914/4375327) linking to a [detailled article](https://www.regular-expressions.info/captureall.html)). – Amessihel Oct 15 '20 at 15:56

2 Answers2

1

We need more context with code (full sample code) to be able to answer fully, so I can only answer in the same general way in which you asked the question.

I do not believe you can do it in one operation with one regular expression.

So the overall technique I would use is:

  1. First match the entire citation (with one or more authors) using a simple regex with only one group, namely for everything between [ and ].
  2. Then, when a match is found, split what is in that match (i.e. everything between the square brackets) by ; to get a list of "prenote @authorkey, postnote" strings.
  3. Do the wanted replacements on each element in that resulting list of single author strings.
  4. Stitch together the final citation by joining the resulting list with semicolons again and adding [ and ] in around it.
  5. Put that final citation in the original instead of the matched string.

You can put steps 2 to 4 in a function f(match_object), and then use re.sub(pattern, f, string) to do the replacement. It will call function f for each match it finds, and replace that match with the return value of f.

Jesper
  • 1,611
  • 13
  • 10
0

You might make use of the PyPi regex module to get the 3 capturing groups.

(?:\G(?!^)|\[(?=[^][\r\n]*\]))[^\S\r\n]*(.*?) @(.*?), ([^][,\r\n]*)[\];]

Regex demo | Python demo

Explanation

  • (?: Non capture group
    • \G(?!^) Assert the position at the end of the previous match, not at the start
    • | Or
    • \[(?=[^][\r\n]*\]) Match [ and assert that there is a closing ]
  • ) Close non capture group
  • [^\S\r\n]* Match 0+ occurrences of a whitespace char except a newline
  • (.*?) Capture group 1, match any char except a newline as least as possible
  • @ Match literally
  • (.*?) Capture group 2, match any char except a newline as least as possible
  • , Match literally
  • ([^][,\r\n]*) Capture group 3, match any char except ] [ , or a newline
  • [\];] Match either ] or ;

Example code using regex.finditer

import regex

pattern = r"(?:\G(?!^)|\[(?=[^][\r\n]*\]))[^\S\r\n]*(.*?) @(.*?), ([^][,\r\n]*)[\];]"

test_str = ("[prenote @autorkey, postnote]\n"
            "[prenote1 @authorekey1, postnote1; prenote2 @authorkey2, postnote2]\n")

matches = regex.finditer(pattern, test_str)

for matchNum, match in enumerate(matches, start=1):
    for groupNum in range(0, len(match.groups())):
        groupNum = groupNum + 1

        print (match.group(groupNum))

Output

prenote
autorkey
postnote
prenote1
authorekey1
postnote1
prenote2
authorkey2
postnote2
The fourth bird
  • 154,723
  • 16
  • 55
  • 70