0

I have an html document with multiple commented-out PHP arrays, e.g.:

<!-- Array
(
[key] => 0
)
-->

Using PHP, I need to somehow parse the HTML for only these comments (there are other comments that will need to be ignored) and extract the contents. I've been trying to use preg_match_all but my regex skills aren't up to much. Could anyone point me in the right direction?

Any help is much appreciated!

Ben
  • 3
  • 2

3 Answers3

2

How about using a HTML Parser that allows you to access comments (For example Simple HTML DOM) and then check each comment for new lines using strpos.

$html = str_get_html('...HTML HERE...');
$comments = $html->find('comment');
foreach ( $comments as $comment ){
    if ( strpos($comment, "\n") !== false ){
        //process comment
    }
}
Yacoby
  • 54,544
  • 15
  • 116
  • 120
  • Thanks - I wonder if there is a way to do something similar through domDocument? – Ben Apr 06 '10 at 13:26
2

Three facts come into play here

  1. there is no place in a HTML document where a literal "<!--" can show up and not mean a comment (everywhere else it would be escaped as "&amp;!--")
  2. you don't seem to want to change the document contents, only find bits in it (search-and-replace has a high probability of breaking the document, search alone has not)
  3. comments cannot be nested in HTML (contrary to normal HTML tags) - this makes all the difference

The above combination means that (lo and behold) regular expressions can be used to identify HTML comments.

Try this regex: <!-- Array([\s\S])*?-->. Match group one will contain everything after "Array" up to the closing sequence of the comment.

You can apply further sanity checking to the found bits to make sure they are in fact what you are looking for.

Tomalak
  • 332,285
  • 67
  • 532
  • 628
  • 2
    2. Incorrect: ` Comments where you don't expect

    Comments where you don't expect

    <!-- this is just alt text
    `
    – Quentin Apr 06 '10 at 13:30
  • Just for clarity, the document I'm dealing with is XHTML 1.0 Strict – Ben Apr 06 '10 at 13:33
  • @David: Yes, that is the edge case (+1 the comment). My remark would be that it is bad, bad style to use unescaped pointy brackets *anywhere* in the document except for tags (and attribute values are the only place where the `<` is… erm… tolerated). But I admit it might happen somewhere, and of course you need to know if it can happen in your data. – Tomalak Apr 06 '10 at 13:45
  • Thank you - this is doing the trick. Sorry to be thick, but is there a way to remove the HTML comment tags and just include the contents? – Ben Apr 06 '10 at 13:45
  • @Ben: That you can match them means you can replace them, doesn't it? *Disclaimer: Even though this answer may look like the opposite, I highly dis-recommend the use of regex to process HTML. There are cases where regex may be an acceptable shortcut, but these are rare, far apart and spotting them is not trivial. Finding comments is one such case, but bear in mind that David Dorward's objection above is correct and needs consideration. Proceed at your own risk.* – Tomalak Apr 06 '10 at 13:57
  • The counter argument is that `alt=" – Quentin Apr 06 '10 at 14:08
  • @David: I disagree. **a)** `alt=""` would be more readable, too, but you can't do that either. **c)** attribute values are *data*, not markup. Their contents has to be fcking *escaped*, period. I wish people would stop being smartasses about where their data could go partially unescaped just because they think they didn't have to. XSS would not be a problem if people were more aware and strict about code and data separation. – Tomalak Apr 06 '10 at 14:16
  • (a) So what? They have special meaning, you can't avoid it (well, you can — so long as the next character is a space or another character in a list I don't have to hand). (b) You can, actually. (c) The SGML specification says otherwise. You might not like it, you might have designed it differently, but it doesn't change that fact. – Quentin Apr 06 '10 at 14:42
  • 1
    @David: **a)** HTML is not about avoiding awkward character sequences as much as possible because they hinder reading. Human consumption is not the primary function of HTML, correctly transporting markup *and* data to a user agent is. **b)** I'm not sure about this. You can because HTML parsers are lenient and forgiving. **c)** Show me the part in the spec that allows it. ;) Bottom line is - "it's possible" != "you can". Exmpl: *It's possible* to do this in PHP: `preg_replace("/\d/", "", $s)`, but it's still *wrong* because it must be `preg_replace("/\\d/", "", $s)`;`. Correct escaping is key. – Tomalak Apr 06 '10 at 15:12
  • (a) SGML was designed (AFAIK) to be convenient to write. It has a lot of short cuts in it. (b) The validator is not "lenient" or "forgiving". (c) I would, but I don't have a copy of the SGML specification handy and I don't care enough to pay for one. – Quentin Apr 06 '10 at 15:49
  • @David: Pay for one? I was under the impression the SGML spec had to be free? Hm. Anyway. :-) **a)** seems you are right with `alt=" – Tomalak Apr 06 '10 at 16:00
-2

Don't parse HTML with regular expressions. Ever.

Community
  • 1
  • 1
Williham Totland
  • 28,471
  • 6
  • 52
  • 68
  • Yes. I wouldn't go as far as to say ever. There are situations where it is easier and works just fine. It is like the people who say "never ever use goto" and then come up for the most convoluted method ever for breaking out of nested loops. – Yacoby Apr 06 '10 at 12:27
  • I can see the rationale for general use, but in this case I know the exact string I'm searching for... – Ben Apr 06 '10 at 13:31