How to use `re.findall` to extract data from string

Question

I have the following string (file):

s = '''
\newcommand{\commandName1}{This is first command}

\newcommand{\commandName2}{This is second command with {} brackets inside
in multiple lines {} {}
}

\newcommand{\commandName3}{This is third, last command}

'''

Now I would like to use Python re package to extract the data to dictionary where key is the command name (\commandName1, \commandName2 and \commandName3) and the values are the This is first command, This is second command with {} brackets inside in multiple lines {} {} and This is third, last command. I tried sth like:

re.findall(r'\\newcommand{(.+)}{(.+)}', s)

but it doesnt work because second command has {} inside. What is the easiest way to do that?

You can use: `\\newcommand{([^}]+)}{(.+)}` – anubhava Oct 16 '22 at 11:02 — anubhava, Oct 16 '22 at 11:02
It doesnt work for the above example (edited) – Jason Oct 16 '22 at 11:24 — Jason, Oct 16 '22 at 11:24

anubhava · Accepted Answer · 2022-10-16T17:00:51.797

3

You may use this regex:

(?s)\\newcommand{([^}]+)}{(.+?)}(?=\s*(?:\\newcommand|$))

RegEx Demo

Code Demo

RegEx Breakdown:

(?s): Enable DOTALL (single line) mode
\\newcommand:
{: Match a {
([^}]+): Match 1+ of any characters that are not { in capture group #1
}: Match a }
{: Match a {
(.+?): Match 1+ of any characters in capture group #2
}: Match a }
(?=\s*(?:\\newcommand|$)): Lookahead to assert presence of 0 or more whitespace and \newcommand or else end of input.

Code:

import re

s = r'''
\newcommand{\commandName1}{This is first command}

\newcommand{\commandName2}{This is second command with {} brackets inside
in multiple lines {} {}
}

\newcommand{\commandName3}{This is third, last command}
'''

print (re.findall(r'(?s)\\newcommand{([^}]+)}{(.+?)}(?=\s*(?:\\newcommand|$))', s))

edited Oct 16 '22 at 17:00

answered Oct 16 '22 at 11:43

anubhava

761,203
64
569
643

The following doesnt work (it founds only the first command `re.findall('\\newcommand{([^}]+)}{(.+?)}(?=\s*\\newcommand|\Z)', s)` – Jason Oct 16 '22 at 14:47
And this found only the first two commands: `re.findall('\\newcommand{([^}]+)}{(.+?)}(?=\s*\\newcommand|\Z)', s, re.DOTALL) ` – Jason Oct 16 '22 at 14:50
@Jason: I have updated my answer to show suggested code and added an online code demo as well. – anubhava Oct 16 '22 at 17:03
Reason why it didn't work for you earlier was presence of few whitespaces/line-breaks before end of input i.e. `\Z` – anubhava Oct 16 '22 at 17:04

score 2 · Answer 2 · answered Oct 16 '22 at 15:39

2

Try (regex101):

import re

s = r"""\newcommand{\commandName1}{This is first command}

\newcommand{\commandName2}{This is second command with {} brackets inside
in multiple lines {} {}
}

\newcommand{\commandName3}{This is third, last command}

"""

out = re.findall(
    r"^\\newcommand\{(.*?)\}\{((?:(?!^\\newcommand).)+)\}", s, flags=re.S | re.M
)
print(out)

Prints:

[
    ("\\commandName1", "This is first command"),
    (
        "\\commandName2",
        "This is second command with {} brackets inside\nin multiple lines {} {}\n",
    ),
    ("\\commandName3", "This is third, last command"),
]

answered Oct 16 '22 at 15:39

Andrej Kesely

168,389
15
48
91

Thank you, could please explain jut last regex used to match the value of command? – Jason Oct 16 '22 at 16:06
1

This will slow down considerably for longer text since it is performing a negative lookahead before matching every character. Even with the current sample text it is taking almost double of the steps than the regex in my answer – anubhava Oct 16 '22 at 16:52
@Jason It's called *Tempered Greedy Token*. Wiktor has good answer about this: https://stackoverflow.com/questions/30900794/tempered-greedy-token-what-is-different-about-placing-the-dot-before-the-negat – Andrej Kesely Oct 16 '22 at 16:58
1

Even that answer mentioned that *Tempered Greedy Token is resource consuming* moreover this problem doesn't even require that. – anubhava Oct 16 '22 at 17:18

How to use `re.findall` to extract data from string

2 Answers2