3

I'm trying to find comment blocks in PHP source code using regular expressions in Python 3. The PHP comments are in this format:

/**
 * This is a very short block comment
 */

Now I came up with the following regular expression:

'/\*\*[.]+?\*/'

I figure that -in combination with the DOTALL flag- should do it, but no. It doesn't find anything. Strange thing is that when I remove the trailing slash, like this:

'/\*\*[.]+?\*'

then it finds the following string:

/**\n\t*

I have no idea why the regex can't find an asterisk followed by a slash... I checked the file that I'm searching to double check I didn't have a typo in the comment (I didn't). Also a slash is no special character in regex, so I wouldn't have to escape it. (I tried, but it didn't help.)

Can anyone tell me what's wrong with my regex? :)

By the way, I also came across this! thread where someone tried to do the same in Java. The final winning answer finished his regular expression the same way I do now, so I'm clueless :( Could this be a bug in Python regex or am I completely missing something?

Any help is much appreciated! :D

Community
  • 1
  • 1
lunanoko
  • 486
  • 1
  • 7
  • 16
  • Why do you have `[.]` in your pattern ? As opposed to just .+ – arunkumar Aug 16 '11 at 16:59
  • Well, because at first I used [.\s] without the DOTALL flag. After I removed the \s and added the DOTALL flag, the square brackets just kept lingering there. However, no that I removed them they seemed to cause the problem. If anyone cares to explain that? As far as my regex knowledge goes '.+' should match the same things as [.]+ right? – lunanoko Aug 16 '11 at 17:43

2 Answers2

5

You can use the re.DOTALL flag to make the . character match newlines:

re.compile(r'/\*\*.+?\*/', re.DOTALL)

(As a side note, PHP block comments can start with /*, not just /**.)

jtbandes
  • 115,675
  • 35
  • 233
  • 266
  • My bad, I forgot to include the DOTALL flag in my post. I did it in my program though, and it doesn't work. The problem seem to be that the last / is not recognized for some reason :( – lunanoko Aug 16 '11 at 17:31
  • Just now I noticed you don't use []'s around the . in your expression. When I remove the []'s in my program the regex works! Could you (or someone else) explain why it works without brackets but doesn't work with them? Thanks for your answer anyway! It works now! :) – lunanoko Aug 16 '11 at 17:38
  • Ah, perhaps that was your problem, `[.]` will match a literal period whereas `.` will match any character. – jtbandes Aug 16 '11 at 17:43
  • Yeah I was wondering why they didn't match the same thing, but now that I thin about it, why would you ever want a . in between []'s... it makes no sense! Thanks for your time and help! :) – lunanoko Aug 16 '11 at 17:51
  • This will not work correctly in PHP because you can have comment characters inside of quoted text. – Richard Mar 09 '15 at 19:57
0

Try this:

r'\/\*\*[^*]*\*+([^/][^*]*\*+)*\/'

(this is the regex used by some CSS parsers for /* CSS comments */, so I believe it is pretty solid)

It won't match the exact format including line breaks and the inner asterisks, but you can work around it. This will match:

/**
 * This is a very short block comment
 */

But also:

/** This is a very short block comment */

And even:

/** This is a very short block comment 
*/

To match the exact format of docblocks, you'd need a real parser, not regular expressions.

moraes
  • 13,213
  • 7
  • 45
  • 59
  • Thank you for your reply. Your expression works, though like you said, it needs some tweaking to work according to my needs :) Going with jtbandes' solution, though, because his does exactly what I want at the moment :) Thanks! – lunanoko Aug 16 '11 at 17:42
  • Both do the same thing. His is simpler; I just copy & pasted from something I had. – moraes Aug 16 '11 at 18:40
  • This could not work in PHP because it does not account for comment characters which may appear in quoted text. Therefore, it will extract things which are not comments. – Richard Mar 09 '15 at 19:56