0

I have a plugin tag [crayon ...] that may or may not be rendered in a <p></p> block like so:

<p>This is a <b>sentence</b> [crayon ...] The Crayon [/crayon] of words. </p>

Since my tag is replaced by a <div> tag, the <p> is left disjoint from </p> and the browser closes it for me, leaving a blank paragraph above my plugin. In any case, the markup is invalid and has weird outcomes. My problem is that I need to detect if [crayon lies between a <p></p> block. I have found two ways so far:

  1. Use <p(?:\s+[^>]*)?>(.*?)</p(?:\s+[^>]*)?> and search for [crayon in the capture.
  2. Use <p[^>]*>(?:[^<]*<(?!/?p(\s+[^>]*)?>)[^>]+(\s+[^>]*)?>)*[^<]*\[crayon for the case of <p>...[crayon where ... doesn't contain a </p> or <p> and a similar method for a </p> after the [crayon] tag.

The second method is harder to read but will fail if a </p> is captured before my tag. It doesn't require any further processing to find my tag within the <p></p> like the first. However, the first regex is much simpler and will execute quicker. Which should I use, and is there a better way?

EDIT:

For method 2, this beast works:

<p[^<]*>(?:[^<]*<(?!/?p(\s+[^>]*)?>)[^>]+(\s+[^>]*)?>)*[^<]*((?:\[crayon[^\]]*\].*?\[/crayon\])|(?:\[crayon[^\]]*/\]))(?:[^<]*<(?!/?p(\s+[^>]*)?>)[^>]+(\s+[^>]*)?>)*[^<]*</p[^<]*>

Aram Kocharyan
  • 20,165
  • 11
  • 81
  • 96
  • 2
    I point you to [this SO discussion](http://stackoverflow.com/questions/590747/using-regular-expressions-to-parse-html-why-not). But regardless, if crayon can appear inside of a

    , why are you using a

    and not a ? If you want to correctly break the

    , you'll need a proper HTML parser.

    – twooster Jan 21 '12 at 04:46
  • In wordpress my plugin must find and parse the `[crayon]` tag after wordpress has formatted the page, otherwise all formatting will be kept plain. If I format my tag before formatting, then wordpress will parse my plugin as well, making a mess of things. Since my plugin must be a div, this causes the issue. By wordpress formatting, I mean it adding

    with the wpautop() function.

    – Aram Kocharyan Jan 21 '12 at 04:59
  • 1
    that kind of regex is known as "write once, read never" – Justin Self Jan 21 '12 at 05:00
  • :) I'd like to avoid doing both if possible, but it works. I'm not convinced that's a good argument to choose in favor of keeping it though. – Aram Kocharyan Jan 21 '12 at 05:01
  • 1
    I don't know much about Wordpress, but, quoting the [doc'm](http://codex.wordpress.org/Shortcode_API): `wpautop recognizes shortcode syntax and will attempt not to wrap p or br tags around shortcodes that stand alone on a line by themselves. Shortcodes intended for use in this manner should ensure that the output is wrapped in an appropriate block tag such as

    or

    .` So maybe for your div-like shortcodes, ensure the tags are on their own lines? Otherwise, I'd create a `crayon_p` shortcode that will wrap in `
    ...

    ` rather than go the regex route.

    – twooster Jan 22 '12 at 22:57

1 Answers1

1

Edit with improved regex, notice I also stole your open p tag detection ;). On PHP, had to add the s modifier for multi line match:

/(?<!<!--)<p[^<]*>(?:[^<]*<(?!/?p(\s+[^>]*)?>)[^>]+(\s+[^>]*)?>)*[^<]*\[crayon.*?\].*?\[\/crayon\].*?<\/p>(?!(\s)?-->)/s

The following string was used for testing. 5 matches expected, 179 steps taken (the single regex from question took 285 steps):

<p>This is a <b>sentence</b> [crayon]...[/crayon] of words.</p>
<p class="large"> Paragraph with parameters [crayon]...[/crayon]</p>
<p>[crayon with-parameters=true]...[/crayon]</p>
<p>
Multiline paragraph [crayon]...[/crayon].
Lorem ipsum.
</p>
<p>...</p><p>[crayon]...[/crayon]</p>
<!-- <p> --> This is a <b>sentence</b> [crayon]...[/crayon] of words.<!-- </p> -->
<pizza>yummy</pizza>

Any improvement?

marcio
  • 10,002
  • 11
  • 54
  • 83
  • 1
    I thought of that, but it fails with this: `

    [crayon]…[/crayon]

    ` by capturing the whole thing. The `.*?` following the opening `` will match anything until it hits a Crayon tag, but I only need it to match the open `

    ` before it. There could in fact be many `

    ..

    ` from the start of the string.
    – Aram Kocharyan Jan 21 '12 at 08:36
  • Also, I tend to now use `

    ]*)?>` because `

    ` won't match tags with attributes and `` could match ``.

    – Aram Kocharyan Jan 21 '12 at 08:38
  • Thanks, also what regex program do you use that gives step measurement? I'm on mac using Reggy and RegExhibit. – Aram Kocharyan Jan 24 '12 at 14:49
  • I use RegexBuddy, on windows and on linux (works ok with wine 1.3.28). you just need to click on debug button and it shows regex execution step by step. – marcio Jan 24 '12 at 15:02