Removing Block Comments via Regex

Question

I am having difficulty trying to get this regex to work. All I am trying to do is remove block comments. This is what I have so far but I can't get rid of the final */.

$string = 'this is a test /*asdfa  */ ok then';

$pattern = '/\/\*([^\*\/]*)/i';

$replacement = '';

echo preg_replace($pattern, $replacement, $string);

//this is a test */ ok then

Any help will be appreciated.

PHP isn't a regular language, so it's impossible parse it or to remove all valid block comments with a regex. — Paul Tomblin, Nov 17 '10 at 17:12
@Paul: You cannot *parse* the PHP with regex, but you can lexically analyze it just fine. One doesn't need a full blown parser to get rid of comments (indeed, usually the comments are thrown out in lexical analysis, not parsing) — Billy ONeal, Nov 17 '10 at 17:15
@Billy No you can't. http://en.wikipedia.org/wiki/Chomsky_hierarchy — Alin Purcaru, Nov 17 '10 at 17:16
@Alin: Why not? Can you show an example that requires fully parsing? You are correct in that you cannot **parse** the language with regex. But you can **lexically analyze** it with regex just fine. — Billy ONeal, Nov 17 '10 at 17:16
I'm the first to discourage regexes for tasks they can't handle. But C-style comment **can** be recognized (and e.g. stripped) by regular expressions, because they cannot nest (this is the same in PHP and quite a few others). `/* a /* b */ echo 'see?'; */` will out "see?" (or rather, the parser rejects it because of the final `*/`, which still proves the point. The SO syntax highlighter gets this right btw. — , Nov 17 '10 at 17:19
See this example `" this is a comment /* or is it? */"`. Should the comment inside the string be removed? Want to make it more complicated? Bring in some heredoc. — Alin Purcaru, Nov 17 '10 at 17:21
@Alin: Okay, but that still doesn't require parsing the language. Syntax highlighters work based on lexical analysis (i.e. regex), and they get the coloring of my comment blocks down just fine. Yes, you would need to account for the comment existing inside a string, but that's the only place I can really think of it. — Billy ONeal, Nov 17 '10 at 17:24
@Alin: I agree with you in that a regex shouldn't be used for this task. However, it is extremely possible, as anyone who's ever written a C compiler on top of `lex` or `flex` can tell you. (It is, admittedly, extremely difficult with the `preg_xxx` functions) — Billy ONeal, Nov 17 '10 at 17:26
@Billy Why do you assume syntax highlighters work with regexes? Why not tokens? As for removing the comments with regex I will consider that it may be possible (but just for comments or other limited subsets of the language). If you can I would like to see some examples of this done. — Alin Purcaru, Nov 17 '10 at 17:26
@Alin: Tokens are typically generated using regular expressions. Syntax highlighters *do* generally do work with tokens. However, the tokens are typically generated using some form of regular expressions. Case in point, `token_get_all`, as indicated in your answer, is probably implemented using regular expressions. You **do** need things more powerful than regular expressions when you move on to things like parsing, or making sense of that stream of tokens. Syntax highlighters don't need to make sense of the tokens, so they're fine with regex only. — Billy ONeal, Nov 17 '10 at 17:30
If you just want to remove all comments (and whitespaces) do `php -w filename.php` on the CLI — Gordon, Nov 17 '10 at 17:32
or use http://de3.php.net/manual/en/function.php-strip-whitespace.php — Gordon, Nov 17 '10 at 17:42
@Alin: Case in point, looking at the PHP source code, the tokenizer is written using `lex`, which is regular expressions ;) (You want the file `$/php-5.3.3/Zend/zend_language_scanner.l`) (Where $ is the root of the source tarball) — Billy ONeal, Nov 17 '10 at 18:00
@Billy regexes used within a programming language qualifies as a programming language and not as regex. — Alin Purcaru, Nov 17 '10 at 18:07
@Alin: No, actually, because `lex`/`flex` generated scanners do not allow you to control the scanner's behavior with code, only with the regular expressions. `lex` and `flex` generated scanners are incapable of recognizing anything but regular languages. Fortunately, most any programming language is tokenizable with regular languages (even the complicated ones like C++). — Billy ONeal, Nov 17 '10 at 18:10
@Alin, that is irrelevant. Next to nobody uses patterns that are constrained to regular languages any more, nor have they for about thirty years. Good morning, Rip Van Winkle! — tchrist, Nov 17 '10 at 23:54
@tchrist: Actually, the preg_xxx functions *are* constrained to what a regular language can recognize. (The preg_xxx functions do not support recursive regexen, though there are parsers which *do* support such features, i.e. [`boost::xpressive`](http://www.boost.org/doc/libs/1_44_0/doc/html/xpressive/user_s_guide.html#boost_xpressive.user_s_guide.grammars_and_nested_matches)) — Billy ONeal, Nov 18 '10 at 01:14
@Billy, how weird! Doesn't seem very perlish then. Even PCRE supports such things. Where's the "PC" part gone? — tchrist, Nov 18 '10 at 01:23
@tchrist: The only PCRE feature of which I am aware which is non-regular is callouts. Perl's is much more complicated because it lets you embed normal language bits inside of the regex, but that's uncommon. (Even things like anchors and lookahead/lookbehind are still regular, despite the fact that building a DFA for them would be insanely complicated in practice) — Billy ONeal, Nov 18 '10 at 03:33
@Billy: Like [this](http://stackoverflow.com/questions/4031112/regular-expression-matching/4034386#4034386). Cool, eh? — tchrist, Nov 18 '10 at 14:58
@tchrist: Yes. Though I hope I never have to write or use anything like that in practice :P If PCRE supports it, than the `preg_xxx` functions should support it too, considering they're built on top of PCRE. — Billy ONeal, Nov 18 '10 at 15:50
@Billy: I have never managed to figure out which PCRE version any given PHP implementation is linked with. I have been counting PHP as regex-rich because of this, but I am not 100% certain. Very very old PCRE didn’t support, but it’s been in there for a long long time. — tchrist, Nov 18 '10 at 15:57

score 6 · Answer 1 · edited Nov 17 '10 at 17:15

6

Try this as your pattern:

/\*([^*]|[\r\n]|(\*+([^*/]|[\r\n])))*\*+/

edited Nov 17 '10 at 17:15

Billy ONeal

104,103
58
317
552

answered Nov 17 '10 at 17:10

wajiw

12,239
17
54
73

I don't understand... what's \r\n doing in there? – Billy ONeal Nov 17 '10 at 17:16
It removes new lines. May not be needed for the example he gave, but that's what I use to remove multi-line comments. – wajiw Nov 17 '10 at 17:19
@wajiw: Why not just make it match *any character*? I don't see how newlines are special. – Billy ONeal Nov 17 '10 at 17:21
I've run into the problem where .* doesn't match new lines – wajiw Nov 17 '10 at 17:24
@wajiw: Ahh.. I forgot about that. Use the `m` modifier to make `.` match newlines. – Billy ONeal Nov 17 '10 at 17:28
1

Yeah that's a smarter way of doing it :-) Thanks – wajiw Nov 17 '10 at 17:32

Billy ONeal · Accepted Answer · 2010-11-17T18:00:06.653

6

Use a different delimiter than / -- it makes it confusing.

How about '#/\*.+?\*/#s';

edited Nov 17 '10 at 18:00

answered Nov 17 '10 at 17:13

Billy ONeal

104,103
58
317
552

1

Should there be an `m` modifier in there though? (disclaimer I'm a little rusty on multiline regexes here.) – BoltClock Nov 17 '10 at 17:28
You actually need `s` (DOT_ALL), not `m`. – Alin Purcaru Nov 17 '10 at 17:56
@Billy - thank you. This is one I can understand and actually works for cases when you want to get rid of `/*** a */` - some of there answers didn't work for this case. In addition, I think its much simpler to use this than the tokenizer. – Abs Nov 18 '10 at 14:46
1

@Abs: 1. Thank you, and 2. To be fair, the tokenizer is going to be more accurate. (But then, you already knew that ;) ) – Billy ONeal Nov 18 '10 at 14:47

score 6 · Answer 3 · answered Nov 17 '10 at 17:13

6

token_get_all and build it back without T_COMMENTs. I don't think anything more should be said.

answered Nov 17 '10 at 17:13

Alin Purcaru

43,655
12
77
90

Everybody loves tokenizers. +1 also for the second sentence. – BoltClock Nov 17 '10 at 17:16
Overkill solution is overkill. Regular expressions are sufficient and more appropriate for *this* use case. (not that the tokenizer wouldn't use them internally as well) – mario Nov 17 '10 at 17:26
@mario Do you know how the tokenizer works? RegEx is not magic... so don't assume you can use them anywhere. – Alin Purcaru Nov 17 '10 at 17:31
@Alin Purcaru: I made no assertions to *anywhere*, but specifically said "this use case". And yes, I do know a few things about them, and I've also written a few tokenizers. – mario Nov 17 '10 at 17:35
@mario Using some regular expressions under certain conditions to find the tokens is different from using *a* regular expression. – Alin Purcaru Nov 17 '10 at 17:39
Likewise is using the php tokenizer sensible under certain conditions (e.g. writing a parser) not for all of them (e.g. looking for comments). I'm not against recommending the pro solution, but there should be some rationalization for its use. Else it reads like mindless parroting the *regex are evil* meme. – mario Nov 17 '10 at 17:47
Regexes are not evil, but rather deceitful. And if you want to use them to find a pattern inside code then you should think at least 3 times before doing it. Building a regex that will find comments, and only comments, like the question asker wanted is, if not impossible, very hard. – Alin Purcaru Nov 17 '10 at 17:53
@Alin you are right about token_get_all being the most professional solution. You are wrong about making too much assumptions about the edge cases. Any of the regexes here will fail given enough PHP code. However it's highly unlikely to fail the asker. What I'm onto, like exactness and security concerns are in favor of the real thing, the use cases and maybe processing incomplete PHP code are in favor of the simpler solutions. We should list pros and cons objectively. (But I'd also say that e.g. speed perceivements of one or the other solution should not be the deciding factor. This time.) – mario Nov 17 '10 at 18:15

score 1 · Answer 4 · answered Nov 17 '10 at 17:30

1

I'm using this (note that you only need the first line for /*...*/ comments):

  #-- extract /* ... */ comment block
  #  or lines of #... #... and //... //...
  if (preg_match("_^\s*/\*+(.+?)\*+/_s", $src, $uu)
  or (preg_match("_^\s*((^\s*(#+|//+)\s*.+?$\n)+)_ms", $src, $uu))) {
     $src = $uu[1];
  }
  // Public Domain, not CC

Works quite well. But like all regex solutions, it would fail on the $PHP = "st/*rings" edge case.

answered Nov 17 '10 at 17:30

mario

144,265
20
237
291

Do not say "like all regex solutions", because you are wrong. Say only that those presented here are too primitive for the feat. – tchrist Nov 17 '10 at 23:55
1

@tchrist: My bad. In trying to righten the regex slander, I've overgeneralized it myself. – mario Nov 17 '10 at 23:59

score 0 · Answer 5 · answered Nov 17 '10 at 17:15

0

Running preg_replace twice with pattern /\*|\*/ should work.

answered Nov 17 '10 at 17:15

Daniel F. Thornton

3,687
2
28
41

2

That only gets rid of the comment delimiters, not the comment text itself. – Billy ONeal Nov 17 '10 at 17:17

score 0 · Answer 6 · edited Nov 18 '10 at 01:58

0

Maybe:

$pattern = '/\/\*([.]*)\*\//i';

Please don't down-rate as this is a quick guess trying to help... :)

edited Nov 18 '10 at 01:58

Alan Moore

73,866
12
100
156

answered Nov 17 '10 at 17:16

Brian H

833
1
5
12

score 0 · Answer 7 · edited Nov 17 '10 at 17:26

0

To just fix your main pattern, I can tell you that your not matching the final "*/" because you are missing it from your pattern.

Following your own pattern, try this little modification:

'/\/\*([^\*\/]*)**\*\/**/i'

I also suggest you to use different delimitators to make the pattern more read-friendly.

#/\*([^\*/]*)\*/#i

edited Nov 17 '10 at 17:26

Billy ONeal

104,103
58
317
552

answered Nov 17 '10 at 17:24

denica

221
1
3
12

Removing Block Comments via Regex

7 Answers7