Regular expression for pandoc-markdown citations

Question

I'm trying to search and replace citations from pandoc-markdown. They have the following syntax:

[prenote @autorkey, postnote]

Or for more than one Author

[prenote1 @authorekey1, postnote1; prenote2 @authorkey2, postnote2]

The pre-notes, the author-keys and the post-notes should each be in their own capture group.

For only one author in a citation I used regex this:

\[((.*) )?@(.*?)(, (.*))?\]

But I can't figure out how to match a citation with multiple authors. Ideally it would be possible to match citations with one or more author keys. The pre-note and the post-note should be optional.

Is this possible?

_"The pre-notes, the author-keys and the post-notes should each be in their own capture group"_. What you're trying to do is capturing a dynamic number of capturing groups, ie repeating a capturing group. It won't work this way (Source: a [SO answer](https://stackoverflow.com/a/3537914/4375327) linking to a [detailled article](https://www.regular-expressions.info/captureall.html)). — Amessihel, Oct 15 '20 at 15:56

score 1 · Answer 1 · answered Oct 15 '20 at 16:27

We need more context with code (full sample code) to be able to answer fully, so I can only answer in the same general way in which you asked the question.

I do not believe you can do it in one operation with one regular expression.

So the overall technique I would use is:

First match the entire citation (with one or more authors) using a simple regex with only one group, namely for everything between [ and ].
Then, when a match is found, split what is in that match (i.e. everything between the square brackets) by ; to get a list of "prenote @authorkey, postnote" strings.
Do the wanted replacements on each element in that resulting list of single author strings.
Stitch together the final citation by joining the resulting list with semicolons again and adding [ and ] in around it.
Put that final citation in the original instead of the matched string.

You can put steps 2 to 4 in a function f(match_object), and then use re.sub(pattern, f, string) to do the replacement. It will call function f for each match it finds, and replace that match with the return value of f.

The fourth bird · Answer 2 · 2020-10-16T20:42:49.013

You might make use of the PyPi regex module to get the 3 capturing groups.

(?:\G(?!^)|\[(?=[^][\r\n]*\]))[^\S\r\n]*(.*?) @(.*?), ([^][,\r\n]*)[\];]

Regex demo | Python demo

Explanation

(?: Non capture group
- \G(?!^) Assert the position at the end of the previous match, not at the start
- | Or
- \[(?=[^][\r\n]*\]) Match [ and assert that there is a closing ]
) Close non capture group
[^\S\r\n]* Match 0+ occurrences of a whitespace char except a newline
(.*?) Capture group 1, match any char except a newline as least as possible
@ Match literally
(.*?) Capture group 2, match any char except a newline as least as possible
, Match literally
([^][,\r\n]*) Capture group 3, match any char except ] [ , or a newline
[\];] Match either ] or ;

Example code using regex.finditer

import regex

pattern = r"(?:\G(?!^)|\[(?=[^][\r\n]*\]))[^\S\r\n]*(.*?) @(.*?), ([^][,\r\n]*)[\];]"

test_str = ("[prenote @autorkey, postnote]\n"
            "[prenote1 @authorekey1, postnote1; prenote2 @authorkey2, postnote2]\n")

matches = regex.finditer(pattern, test_str)

for matchNum, match in enumerate(matches, start=1):
    for groupNum in range(0, len(match.groups())):
        groupNum = groupNum + 1

        print (match.group(groupNum))

Output

prenote
autorkey
postnote
prenote1
authorekey1
postnote1
prenote2
authorkey2
postnote2

Regular expression for pandoc-markdown citations

2 Answers2