1

I'm attempting to write a simple lexer in python. I'm using regular expressions to do it. So, I need a regular expression matching a multiline comment:

/* first line.
the second line
The last line. */

By using this pattern:

pattern = r"/\*.*\*/" 

and compiling it with

regex = re.compile(pattern,re.DOTALL) 

it works.

Now, i won't use re.DOTALL, 'cause this works also with single-quoted strings. Is there a way to compile this expression in order to work without re.DOTALL?

tobias_k
  • 81,265
  • 12
  • 120
  • 179
Germano Carella
  • 473
  • 6
  • 14
  • Use a character class containing the dot and a newline character. – Malik Brahimi Mar 12 '15 at 21:49
  • You need to escape the asterisks. – Malik Brahimi Mar 12 '15 at 21:51
  • You probably want `r'/\*.*?\*/'`; note the `.*?` instead of `.*`. This will make the regular expression give you the shortest match possible instead of the longest match possible. Try it on inputs like `/* a */ b /* c */`... my guess is that you want two matches, instead of just one. – Dietrich Epp Mar 12 '15 at 22:17
  • Could you elaborate what is the problem with _dotall_ and single-quoted strings ? If you are trying to parse c-style comments, this isn't the way. –  Mar 13 '15 at 00:45

3 Answers3

2

You can achieve the same by using a little trick like this [\s\S].

The idea behind [\s\S] is to capture everything, so you can delimit what you want using an explicit pattern. For instance:

/\*        <--- Match /*
[\s\S]*?   <--- Match everything (ungreedy)
\*/        <--- Match */

You can use a regex like this:

/\*[\s\S]*?\*/

If you want to capture the content within the comment then you could do:

/\*([\s\S]*?)\*/

Working demo

You can see how this trick works below:

enter image description here

Btw, you are using a greedy regex /\*.*\*/ that will wrongly match comments. For instance, if you have:

/* A */
/* B */

You regex will wrongly match /* A *//* B */. You have to add ? to set it as ungreedy as this:

/\*.*?\*/
     ^--- ungreedy
Federico Piazza
  • 30,085
  • 15
  • 87
  • 123
1

Alternatively to re.XXX constants you can use inline flags:

re.match('(?s)/\*.*?\*/', stuff)

From the docs:

(?iLmsux) (One or more letters from the set 'i', 'L', 'm', 's', 'u', 'x'.) The group matches the empty string; the letters set the corresponding flags: re.I (ignore case), re.L (locale dependent), re.M (multi-line), re.S (dot matches all), re.U (Unicode dependent), and re.X (verbose), for the entire regular expression.

I prefer inlines to re.XXX flags for two reasons: 1) expressions are self-contained and 2) no need to use compile or to append the flags param to every re. call.

georg
  • 211,518
  • 52
  • 313
  • 390
0

If we want to enumerate all possibilies, I'll also post my answer:

/\*(?:[\r\n]|[^\r\n])*\*/

See example here.

However, it requires 147 steps to compute with your example, while Fede's /\*[\s\S]*\*/ only needs 12.

If we compare performance between the versions with capturing groups - /\*((?:[\r\n]|[^\r\n])*)\*/ and /\*([\s\S]*?)\*/, the ratio is already not that large: 151 vs. 97 steps.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • Using alternation here is not a good idea. Most regex engines don't really optimize the character class here, so it will create a choice point after every character matched, compared to the `(?s).` or `[\s\S]` method which don't create any choice point. – nhahtdh Mar 13 '15 at 09:16