You can use this:
$pattern = <<<'LOD'
~
# definitions :
(?(DEFINE) (?<tagBL> pre | code | textarea | style | script )
(?<tagContent> < (\g<tagBL>) \b .*? </ \g{-1} > )
(?<tags> < [^>]* > )
(?<cdata> <!\[CDATA .*? ]]> )
(?<exclusionList> \g<tagContent> | \g<cdata> | \g<tags>)
)
# pattern :
\g<exclusionList> (*SKIP) (*FAIL) | \s+
~xsi
LOD;
$html = preg_replace($pattern, ' ', $html);
Note that this is a general approach, you can easily adapt it to a specific case by adding or removing things to the exclusion list.
If you need other type of replacements you can adapt it too by using capturing groups and preg_replace_callback()
.
An other notice: an html tag stay open until a closing tag. If the closing tag doesn't exist all the content after the tag belongs to this tag until the end of the string. To deal with that, you can change </ \g{-1} >
to (?: </ (?:\g{-1}| head | body | html) > | $)
in the tag content definition for example, or compose more advanced rules.
EDIT:
Some informations you can find in the php manual:
The nowdoc syntax is an alternative syntax to define strings.
It can be very useful to make more readable a multiline string without modifying his layout and avoiding questions about escaping quotes or not.
The nowdoc syntax have the same behaviour than single quotes, i.e. variables are not interpreted as escaped format markers like \t
or \n
. If you want the same behaviour than double quotes, use the heredoc syntax.
some informations you can find in http://pcre.org/pcre.txt:
First at all: The pattern delimiter
Most of the time, people write their patterns with the /
delimiter. /Gnagnagna/
, /blablabla/ixUums
, etc.
But when they write a pattern with about a thousand or a million of slash characters, they prefer escaping each of the thousand slashes, one by one, to choose an other delimiter! With PHP, you can choose the pattern delimiter you want if it is not an alphanumeric character. I have choosen ~
instead of /
for three reasons:
- If I choose
~
, I don't have to escape slashes, because there is no ambiguity with the delimiter and a literal character.
- I have never seen during height months in this site, somebody who ask for a pattern with a tilde inside.
- I'm sure if one day someone asks a pattern with a tilde is that I have had an encounter of the third kind.
Second: How to make a long pattern more readable?
PCRE (Perl Common Regular Expression, the regex engine used by PHP) has ways to make a code more readable. These ways are exactly the same you can find in common code:
- You can ignore white spaces
- You can add comments
- You can define subpatterns
For 1 and 2, it's easy, you only need to add the x modifier (it is the reason why you find an x at the end). The x modifier allows the verbose mode where white spaces are ignored and where you can add comments like this # comment
at ends of line.
About subpatterns: You can used named groups, example: instead of writing ~([0-9]+)~
to match and capture a number inside group 1, you can write ~(?<number>[0-9]+)~
. Now, with this named subpattern, you can refer to the captured content with \g{number}
or to the pattern itself with \g<number>
, anywhere in the pattern. Examples:
~^(?<num>[0-9]+)(?<letter>[a-z]+)\g<num>\g<letter>$~
will match 45ab67cd
~^(?<num>[0-9]+)(?<letter>[a-z]+)\g{num}\g<letter>$~
will match 45ab45cd
but not 45ab67cd
In these two examples, named subpatterns are part of the main pattern and match the start of the string. But using the (?(DEFINE)...)
syntax, you can define them out of the main pattern, because all that you write between these parenthesis are not matched.
~(?(DEFINE)(?<num>[0-9]+)(?<letter>[a-z]+))^\g<num>\g<letter>$~
doesn't match 45ab67cd
, because all inside the DEFINE
part is ignored for the match, but:
~(?(DEFINE)(?<num>[0-9]+)(?<letter>[a-z]+))^\g<num>\g<letter>\g<num>\g<letter>$~
does.
Third: relative backreferences
When you use a capturing group in a pattern, you can use a reference to the captured content, example:
$str = 'cats meow because cats are bad.';
$pattern = '~^(\w+) \w+ \w+ \1 \w+ \w+\.$~';
var_dump(preg_match($pattern, $str));
the current code return true
since the pattern matches the string. In the pattern, \1
refers to the content (cats
) of the first capturing group. Instead of writing \1
, you can use the oniguruma syntax and writing \g{1}
that refers to the first capturing group too, it is the same.
Now, if you want to refer to the content of the last capturing group, but you don't care about the number (or the name) of the group, you can use a relative reference by writing \g{-1}
(i.e. the first group on my left)
Fourth: the modifiers xsi
The general behaviour of a pattern can be changed by modifiers. Here I used three modifiers:
x # for verbose mode
i # make the pattern case insensitive (i.e. '~CaT~i' will match "cat")
s # (singleline mode): by default the . doesn't match newline, with the s modifier it does.
The last: Backtracking control verbs
Backtracking control verbs are an experimental feature herited from the perl regex engine (the state is experimental in perl too, but if nobody use it, it will not change).
What is the backtracking?
if I try to match "aaaaab"
with ~a+ab~
the regex engine, since +
is a greedy quantifier, will catch all the a
(five a), but after it stay only a b
that does not match the subpattern ab
. The only way for the regex engine is to get back one a
, and then it is possible to match ab
. It is the default behaviour of the regex engine.
More about backtracking here and here.
The backtracking control verbs are tools that enforces the regex engine to have the behaviour you want for a subpattern.
Here I used two verbs : (*SKIP)
and (*FAIL)
(*FAIL)
is the most easy. The subpattern is forced to fail immediatly.
(*SKIP)
: when a subpattern will fail after this verb, the regex engine don't have the right to backtrack characters matched before this verb. And this content can't be reused for another alternative subpattern.
I understand that all these things are not always easy, but I hope that, step by step, one day, all of these things will be clear for you.