0

Been beating my head against a wall trying to get this to work - help from any regex gurus would be greatly appreciated!

The text that has to be matched

[template option="whatever"] 

<p>any amount of html would go here</p>

[/template]

I need to pull the 'option' value (i.e. 'whatever') and the html between the template tags.

So far I have:

> /\[template\s*option=["\']([^"\']+)["\']\]((?!\[\/template\]))/

Which gets me everything except the html between the template tags.

Any ideas?

Thanks, Chris

Chris
  • 40
  • 3
  • 1
    Which language are you using? – aqua Jan 23 '11 at 03:53
  • 3
    What happens if `

    this is how you break a parser: [/template] It's broken now!

    ` is the html?
    – ircmaxell Jan 23 '11 at 03:53
  • aqua: PHP ircmaxell: doesn't matter – Chris Jan 23 '11 at 03:55
  • Please post the language you're using – AlanFoster Jan 23 '11 at 03:57
  • @user551841 he did, it's PHP ... @Chris wow I thought I knew regexes but I don't get the middle part `"\'["\']]((?!` at all! Or is the PHP syntax that special? – Felix Dombek Jan 23 '11 at 04:02
  • 1
    I suspect that you forgot to escape brackets. Remember - they have special meaning in regex? – ulidtko Jan 23 '11 at 04:05
  • obligatory link to reasons not to use regular expressions to parse a non-regular language: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – Mark Elliot Jan 23 '11 at 04:05
  • Oh yes, answering such a question just cost me 2 rep from people who think so. If someone has an answer, I recommend to just put it in a comment – Felix Dombek Jan 23 '11 at 04:08
  • ah.. I was using a blockquote instead of pre to display the regex and it was removing some of the characters - sorry! – Chris Jan 23 '11 at 04:11
  • Mark E: I'm not really trying to parse the html - just to identify it! The content between the [template][/template] tag could be anything... – Chris Jan 23 '11 at 04:14
  • 1
    well, your second parenthesis group includes the `[/template]` tag, but otherwise you should be able to access the contents of the parens by number! For the HTML, you can simply try a "reluctant" `.*` (probably `.*?` but I'm not familiar with PHP). Also be aware, of course, that your `option` value should not be empty or contain escaped `"` chars, otherwise this will not work ... – Felix Dombek Jan 23 '11 at 04:16

4 Answers4

1

Try this

/\[template\s*option=\"(.*)\"\](.*)\[\/template]/

basically instead of using complex regex to match every single thing just use (.*) which means all since you want everything in between its not like you want to verify the data in between

bhappy
  • 62
  • 3
  • 17
  • Yes I tried this but it doesn't work I presume because '.' is any character except a new line.. which the content may have. Replacing (.*) with ([.\n]) didn't work either. – Chris Jan 23 '11 at 04:22
  • @Chris, there's a [multi-line modifier in PHP's regex](http://us.php.net/manual/en/reference.pcre.pattern.modifiers.php), in this case you'd follow the expression with an `m`. – Mark Elliot Jan 23 '11 at 04:32
  • @Mark E, great thanks for the tip and that brilliant answer to parsing html with regex is going on my wall to cheer me up first thing on a monday morning! – Chris Jan 23 '11 at 04:41
1

edit: [\s\S] will match anything that is space or not space.

you may have a problem when there are consecutive blocks in a large string. in that case you will need to make a more specific quantifier - either non greedy (+?) or specify range {1,200} or make the [\s\S] more specific

/\[template\s*option=["\']([^"\']+)["\']\]([\s\S]+)\[\/template\]/
amcashcow
  • 724
  • 1
  • 6
  • 16
0

The assertion ?! method is unneeded. Just match with .*? to get the minimum giblets.

/\[template\s*option=\pP([\h\w]+)\pP\]  (.*?)  [\/template\]/x
mario
  • 144,265
  • 20
  • 237
  • 291
0

Chris,

I see you've already accepted an answer. Great!

However, I don't think use of regular expressions is the right solution here. I think you can get the same effect by using string manipulations (substrings, etc)

Here is some code that may help you. If not now, maybe later in your coding endeavors.

<?php

    $string = '[template option="whatever"]<p>any amount of html would go here</p>[/template]';

    $extractoptionline = strstr($string, 'option=');
    $chopoff = substr($extractoptionline,8);
    $option = substr($chopoff, 0, strpos($chopoff, '"]'));

    echo "option: $option<br \>\n";

    $extracthtmlpart = strstr($string, '"]');
    $chopoffneedle = substr($extracthtmlpart,2);
    $html = substr($chopoffneedle, 0, strpos($chopoffneedle, '[/'));

    echo "html: $html<br \>\n";

?>

Hope this helps anyone looking for a similar answer with a different flavor.

aqua
  • 3,269
  • 28
  • 41
  • can I ask why don't you think use of regular expressions is the right solution here? What is the disadvantage of using a regular expression? – Chris Jan 23 '11 at 04:52
  • @Chris: For your purposes, and because you have a valid solution now, you can use regex. However, in general, I use regular expressions when I want to find some text in a document for which I cannot control (or do not know) the formatting. If the format of the document is more or less statically known, or has a particular structure, you can use string manipulation functions like I did here. Notice that the functions implicitly do the same thing as regex (find `[/` etc...). – aqua Jan 23 '11 at 05:46