Can I match multiline string in python without using re.DOTALL?

Question

I'm attempting to write a simple lexer in python. I'm using regular expressions to do it. So, I need a regular expression matching a multiline comment:

/* first line.
the second line
The last line. */

By using this pattern:

pattern = r"/\*.*\*/"

and compiling it with

regex = re.compile(pattern,re.DOTALL)

it works.

Now, i won't use re.DOTALL, 'cause this works also with single-quoted strings. Is there a way to compile this expression in order to work without re.DOTALL?

Use a character class containing the dot and a newline character. — Malik Brahimi, Mar 12 '15 at 21:49
You probably want `r'/\*.*?\*/'`; note the `.*?` instead of `.*`. This will make the regular expression give you the shortest match possible instead of the longest match possible. Try it on inputs like `/* a */ b /* c */`... my guess is that you want two matches, instead of just one. — Dietrich Epp, Mar 12 '15 at 22:17
Could you elaborate what is the problem with _dotall_ and single-quoted strings ? If you are trying to parse c-style comments, this isn't the way. — , Mar 13 '15 at 00:45

Federico Piazza · Accepted Answer · 2015-03-12T22:09:24.270

You can achieve the same by using a little trick like this [\s\S].

The idea behind [\s\S] is to capture everything, so you can delimit what you want using an explicit pattern. For instance:

/\*        <--- Match /*
[\s\S]*?   <--- Match everything (ungreedy)
\*/        <--- Match */

You can use a regex like this:

/\*[\s\S]*?\*/

If you want to capture the content within the comment then you could do:

/\*([\s\S]*?)\*/

Working demo

You can see how this trick works below:

enter image description here

Btw, you are using a greedy regex /\*.*\*/ that will wrongly match comments. For instance, if you have:

/* A */
/* B */

You regex will wrongly match /* A *//* B */. You have to add ? to set it as ungreedy as this:

/\*.*?\*/
     ^--- ungreedy

georg · Answer 2 · 2015-03-13T07:24:15.823

1

Alternatively to re.XXX constants you can use inline flags:

re.match('(?s)/\*.*?\*/', stuff)

From the docs:

(?iLmsux) (One or more letters from the set 'i', 'L', 'm', 's', 'u', 'x'.) The group matches the empty string; the letters set the corresponding flags: re.I (ignore case), re.L (locale dependent), re.M (multi-line), re.S (dot matches all), re.U (Unicode dependent), and re.X (verbose), for the entire regular expression.

I prefer inlines to re.XXX flags for two reasons: 1) expressions are self-contained and 2) no need to use compile or to append the flags param to every re. call.

edited Mar 13 '15 at 07:24

answered Mar 12 '15 at 21:53

georg

211,518
52
313
390

You don't have to use `compile` either way, you can pass flags to e.g. `re.match()`. – Dietrich Epp Mar 12 '15 at 22:14
@DietrichEpp: thanks, I forgot about that. Never used flags this way. – georg Mar 13 '15 at 07:24

score 0 · Answer 3 · answered Mar 12 '15 at 22:01

0

If we want to enumerate all possibilies, I'll also post my answer:

/\*(?:[\r\n]|[^\r\n])*\*/

See example here.

However, it requires 147 steps to compute with your example, while Fede's /\*[\s\S]*\*/ only needs 12.

If we compare performance between the versions with capturing groups - /\*((?:[\r\n]|[^\r\n])*)\*/ and /\*([\s\S]*?)\*/, the ratio is already not that large: 151 vs. 97 steps.

answered Mar 12 '15 at 22:01

Wiktor Stribiżew

607,720
39
448
563

Using alternation here is not a good idea. Most regex engines don't really optimize the character class here, so it will create a choice point after every character matched, compared to the `(?s).` or `[\s\S]` method which don't create any choice point. – nhahtdh Mar 13 '15 at 09:16

Can I match multiline string in python without using re.DOTALL?

3 Answers3

Linked