2

I'm in a trouble with treating HTML in text content. I'm thinking about a method that detects those tags and wrap all consecutive one inside code tags.

Don't wrap me<p>Hello</p><div class="text">wrap me please!</div><span class="title">wrap me either!</span> Don't wrap me <h1>End</h1>.

//expected result

Don't wrap me<code><p>Hello</p><div class="text">wrap me please!</div><span class="title">wrap me either!</span></code>Don't wrap me <code><h1>End</h1></code>.

Is this possible?

Lewis
  • 14,132
  • 12
  • 66
  • 87

3 Answers3

4

It is hard to use DOMDocument in this specific case, since it wraps automatically text nodes with <p> tags (and add doctype, head, html). A way is to construct a pattern as a lexer using the (?(DEFINE)...) feature and named subpatterns:

$html = <<<EOD
Don't wrap me<p>Hello</p><div class="text">wrap me please!</div><span class="title">wrap me either!</span> Don't wrap me <h1>End</h1>
EOD;

$pattern = <<<'EOD'
~
(?(DEFINE)
    (?<self>    < [^\W_]++ [^>]* > )
    (?<comment> <!-- (?>[^-]++|-(?!->))* -->)
    (?<cdata>   \Q<![CDATA[\E (?>[^]]++|](?!]>))* ]]> )
    (?<text>    [^<]++ )
    (?<tag>
        < ([^\W_]++) [^>]* >
        (?> \g<text> | \g<tag> | \g<self> | \g<comment> | \g<cdata> )*
        </ \g{-1} >
    )
)
# main pattern
(?: \g<tag> | \g<self> | \g<comment> | \g<cdata> )+
~x
EOD;

$html = preg_replace($pattern, '<code>$0</code>', $html);

echo htmlspecialchars($html);

The (?(DEFINE)..) feature allows to put a definition section inside a regex pattern. This definition section and the named subpatterns inside don't match nothing, they are here to be used later in the main pattern.

(?<abcd> ...) defines a subpattern you can reuse later with \g<abcd>. In the above pattern, subpatterns defined in this way are:

  • self: that describes a self-closing tag
  • comment: for html comments
  • cdata: for cdata
  • text: for text (all that is not a tag, a comment, or cdata)
  • tag: for html tags that are not self-closed

self:
[^\W_] is a trick to obtain \w without the underscore. [^\W]++ represents the tag name and is used too in the tag subpattern.
[^>]* means all that is not a > zero or more times.

comment:
(?>[^-]++|-(?!->))* describes all the possible content inside an html comment:

(?>          # open an atomic group
    [^-]++   # all that is not a literal -, one or more times (possessive)
  |          # OR
    -        # a literal -
    (?!->)   # not followed by -> (negative lookahead)
)*           # close and repeat the group zero or more times 

cdata:
All characters between \Q..\E are seen as literal characters, special characters like [ don't need to be escaped. (This only a trick to make the pattern more readable).
The content allowed in CDATA is described in the same way than the content in html comments.

text:
[^<]++ all characters until an opening angle bracket or the end of the string.

tag:
This is the most insteresting subpattern. Lines 1 and 3 are the opening and the closing tag. Note that, in line 1, the tag name is captured with a capturing group. In line 3, \g{-1} refers to the content matched by the last defined capturing group ("-1" means "one on the left").
The line 2 describes the possible content between an opening and a closing tag. You can see that this description use not only subpatterns defined before but the current subpattern itself to allow nested tags.

Once all items have been set and the definition section closed, you can easily write the main pattern.

Casimir et Hippolyte
  • 88,009
  • 5
  • 94
  • 125
  • It works like magic, though I don't get a word on it. Big thanks for your answer. – Lewis Apr 03 '14 at 15:57
  • Thank you so much for your explanations. This could be the best method at this time, I think. – Lewis Apr 03 '14 at 16:26
  • I got an error `Compilation failed: assertion expected after (?( at offset 6` when I try to use it. Do you have any idea about this? – Lewis Apr 04 '14 at 17:53
  • 1
    If anybody is coming across this problem, take a look at this http://stackoverflow.com/questions/22870199/regex-compilation-failed-assertion-expected-at-offset-6 – Lewis Apr 04 '14 at 19:29
0

I'm in a trouble with treating HTML in text content.

then just escape that text:

echo htmlspecialchars($your_text_that_may_contain_html_code);

parsing html with regex is a well-known-big-NO!

Community
  • 1
  • 1
Sharky
  • 6,154
  • 3
  • 39
  • 72
  • 3
    I got slapped for linking to that answer yesterday - no offense, just [passing the slap on to you](http://meta.stackexchange.com/questions/182189/) ;-) – freefaller Apr 03 '14 at 14:46
  • 1
    @freefaller not my fault that answer still exists, so i have every right to link there. no offense, no slaps taken :P – Sharky Apr 03 '14 at 14:50
0

This will find tags along with their closing tags, and everything in between:

<[A-Z][A-Z0-9]*\b[^>]*>.*?</\1>

You might be able to capture those tags and replace them with the tags around them. It may not work with every case, but you might find it sufficient for your needs if the html is fairly static.