regexp: match all but every <(pre|code|textarea)>(.*?) in an html document

Question

It's a challenge!

As the title says, I would like to match everything but the content of the tags <pre>, <code> and <textarea> in an HTML document (for example you can try on the following text).

The purpose in my case is for a compression of html with removal of \n \t \r and other cleanup except where it is strictly required like in textarea.

As I work in PHP I also thought about extracting those tags content, treat the rest in PHP and reinject them in PHP. But I'm very curious of a way to do that in regexp!

I tried on the great online editor: http://regex101.com/ the expression ((?=.?)((?!<pre>).)) with the flags 'msg' but is not exactly what I want.

Any help would be much appreciated!

Lorem ipsum dolor sit amet, consectetuer adipiscing elit, sed diam nonummy nibh euismod tincidunt ut laoreet dolore magna <span>aliquam</span> erat volutpat. Ut wisi enim ad minim veniam, quis nostrud exerci tation ullamcorper suscipit lobortis nisl ut aliquip ex ea commodo consequat.

<pre>Duis autem vel eum iriure dolor in hendrerit in vulputate velit esse molestie consequat, vel illum dolore eu feugiat nulla facilisis at vero eros et accumsan et iusto odio dignissim qui blandit praesent luptatum zzril delenit augue duis dolore te feugait nulla facilisi. Nam liber tempor cum soluta nobis eleifend option congue nihil imperdiet doming id quod mazim placerat facer possim assum.
Typi non habent claritatem insitam; est usus legentis in iis qui facit eorum claritatem.</pre>

Investigationes demonstraverunt lectores legere me lius quod ii legunt saepius.
Claritas est etiam processus dynamicus, qui sequitur mutationem consuetudium lectorum.
<pre>Mirum est notare quam littera gothica, quam nunc putamus parum claram, anteposuerit litterarum formas humanitatis per seacula quarta decima et quinta decima.</pre>
Eodem modo typi, qui nunc nobis videntur parum clari, fiant sollemnes in futurum.

This is not a good fit for regular expressions. You need to be sure that the HTML you are parsing conforms strictly to the assumptions of the expression (e.g. Steve's answer). If your forum allows the user to use simple html for formatting, you probably can't ensure that. What if someone uses nested `
` tags or forgets to close one? Consider using an existing HTML parser. — Jens, Dec 06 '13 at 11:30
Jens, I was thinking I perfectly know the html content of my site until you remind me the forum part... You are totally right. :( That apart, if I want to work on a specific part of the site that I certify is well coded my question remains, at least to know how in regexp. Thank you for helping. — antoni, Dec 06 '13 at 18:38
Jens, Actually, that can still work if I exclude my forum content div with the Casimir et Hippolyte code. — antoni, Dec 06 '13 at 19:17

Casimir et Hippolyte · Accepted Answer · 2014-06-18T02:11:45.193

You can use this:

$pattern = <<<'LOD'
~
# definitions : 
(?(DEFINE) (?<tagBL> pre | code | textarea | style | script )
     (?<tagContent> < (\g<tagBL>) \b .*? </ \g{-1} > )
     (?<tags> < [^>]* > )
     (?<cdata> <!\[CDATA .*? ]]> )

     (?<exclusionList> \g<tagContent> | \g<cdata> | \g<tags>)
)

# pattern :
\g<exclusionList> (*SKIP) (*FAIL) | \s+
~xsi
LOD;

$html = preg_replace($pattern, ' ', $html);

Note that this is a general approach, you can easily adapt it to a specific case by adding or removing things to the exclusion list. If you need other type of replacements you can adapt it too by using capturing groups and preg_replace_callback().

An other notice: an html tag stay open until a closing tag. If the closing tag doesn't exist all the content after the tag belongs to this tag until the end of the string. To deal with that, you can change </ \g{-1} > to (?: </ (?:\g{-1}| head | body | html) > | $) in the tag content definition for example, or compose more advanced rules.

EDIT:

Some informations you can find in the php manual:

The nowdoc syntax is an alternative syntax to define strings.
It can be very useful to make more readable a multiline string without modifying his layout and avoiding questions about escaping quotes or not.
The nowdoc syntax have the same behaviour than single quotes, i.e. variables are not interpreted as escaped format markers like \t or \n. If you want the same behaviour than double quotes, use the heredoc syntax.

some informations you can find in http://pcre.org/pcre.txt:

First at all: The pattern delimiter

Most of the time, people write their patterns with the / delimiter. /Gnagnagna/, /blablabla/ixUums, etc.
But when they write a pattern with about a thousand or a million of slash characters, they prefer escaping each of the thousand slashes, one by one, to choose an other delimiter! With PHP, you can choose the pattern delimiter you want if it is not an alphanumeric character. I have choosen ~ instead of / for three reasons:

If I choose ~, I don't have to escape slashes, because there is no ambiguity with the delimiter and a literal character.
I have never seen during height months in this site, somebody who ask for a pattern with a tilde inside.
I'm sure if one day someone asks a pattern with a tilde is that I have had an encounter of the third kind.

Second: How to make a long pattern more readable?

PCRE (Perl Common Regular Expression, the regex engine used by PHP) has ways to make a code more readable. These ways are exactly the same you can find in common code:

You can ignore white spaces
You can add comments
You can define subpatterns

For 1 and 2, it's easy, you only need to add the x modifier (it is the reason why you find an x at the end). The x modifier allows the verbose mode where white spaces are ignored and where you can add comments like this # comment at ends of line.

About subpatterns: You can used named groups, example: instead of writing ~([0-9]+)~ to match and capture a number inside group 1, you can write ~(?<number>[0-9]+)~. Now, with this named subpattern, you can refer to the captured content with \g{number} or to the pattern itself with \g<number>, anywhere in the pattern. Examples:

~^(?<num>[0-9]+)(?<letter>[a-z]+)\g<num>\g<letter>$~

will match 45ab67cd

~^(?<num>[0-9]+)(?<letter>[a-z]+)\g{num}\g<letter>$~

will match 45ab45cd but not 45ab67cd

In these two examples, named subpatterns are part of the main pattern and match the start of the string. But using the (?(DEFINE)...) syntax, you can define them out of the main pattern, because all that you write between these parenthesis are not matched.

~(?(DEFINE)(?<num>[0-9]+)(?<letter>[a-z]+))^\g<num>\g<letter>$~

doesn't match 45ab67cd, because all inside the DEFINE part is ignored for the match, but:

~(?(DEFINE)(?<num>[0-9]+)(?<letter>[a-z]+))^\g<num>\g<letter>\g<num>\g<letter>$~

does.

Third: relative backreferences

When you use a capturing group in a pattern, you can use a reference to the captured content, example:

$str = 'cats meow because cats are bad.';

$pattern = '~^(\w+) \w+ \w+ \1 \w+ \w+\.$~';

var_dump(preg_match($pattern, $str));

the current code return true since the pattern matches the string. In the pattern, \1 refers to the content (cats) of the first capturing group. Instead of writing \1, you can use the oniguruma syntax and writing \g{1} that refers to the first capturing group too, it is the same.

Now, if you want to refer to the content of the last capturing group, but you don't care about the number (or the name) of the group, you can use a relative reference by writing \g{-1} (i.e. the first group on my left)

Fourth: the modifiers xsi

The general behaviour of a pattern can be changed by modifiers. Here I used three modifiers:

x # for verbose mode
i # make the pattern case insensitive (i.e. '~CaT~i' will match "cat")
s # (singleline mode): by default the . doesn't match newline, with the s modifier it does.

The last: Backtracking control verbs

Backtracking control verbs are an experimental feature herited from the perl regex engine (the state is experimental in perl too, but if nobody use it, it will not change).

What is the backtracking?

if I try to match "aaaaab" with ~a+ab~ the regex engine, since + is a greedy quantifier, will catch all the a (five a), but after it stay only a b that does not match the subpattern ab. The only way for the regex engine is to get back one a, and then it is possible to match ab. It is the default behaviour of the regex engine.

More about backtracking here and here.

The backtracking control verbs are tools that enforces the regex engine to have the behaviour you want for a subpattern.

Here I used two verbs : (*SKIP) and (*FAIL)

(*FAIL) is the most easy. The subpattern is forced to fail immediatly.

(*SKIP): when a subpattern will fail after this verb, the regex engine don't have the right to backtrack characters matched before this verb. And this content can't be reused for another alternative subpattern.

I understand that all these things are not always easy, but I hope that, step by step, one day, all of these things will be clear for you.

First, thank you. You're like awesome. I don't know this syntax but it works. Now, why can't it be simpler, in regexp-like format? Can you explain a little more to be sure I don't miss anything please? Thanks a lot though. — antoni, Dec 06 '13 at 18:58
Can you explain, I mean: what is 'LOD', '~', 'tagBL', '\g{-1}', '(*SKIP) (*FAIL)', '~xsi', why is there cdata and are the labels 'exclusionList', 'tagContent', 'cdata', 'tags' predefined? Thanks again. — antoni, Dec 06 '13 at 19:07
IT IS ALL CLEAR! except for the (*SKIP) and (*FAIL) thing I have to read more on. Wow I should print your answer and meditate on it every day! It's amazing I learnt a lot thank you again. — antoni, Dec 07 '13 at 15:35

score 1 · Answer 2 · edited May 23 '17 at 10:32

1

If you want parse html, I would suggest you to use PHP DOMxpath or similar, as it's meant and specialised for that task. You'll find chrome extensions to test your queries.

Also read this answer, it's funny: You can't parse [X]HTML with regex. Because HTML can't be parsed by regex was voted more than 4400 times

edit: With that said, may be your need to parse only fragments or not valid html, then I'll go with a "simple" regex approach like Steve P answered above.

edited May 23 '17 at 10:32

Community

1
1

answered Dec 06 '13 at 10:49

alfonsodev

2,714
2
23
30

You can't parse HTML with regex, but if you're dealing with a restricted case, it's possible. Does my answer not work for the restrictions given? – Steve P. Dec 06 '13 at 10:51
ye I was adding that clarification when you posted the comment :) I'm agree. – alfonsodev Dec 06 '13 at 10:53
thank you for this idea. I don't need an extra library in this case it should be very light and done in a blink. in a few PHP lines. ;) – antoni Dec 06 '13 at 18:25

score 0 · Answer 3 · edited May 23 '17 at 10:27

0

Assuming you want to capture what's in between the tags:

regex = "<((?!pre|code|textarea))>([^<]+)</\1>"

(?!...) is a negative look-ahead
([^<]+) group and capture 1 or more characters that are not <
\1 refers to the original capturing group (tag)

This is based on the assumption that < is not a valid character in between tags, implying that tags are not nested. If said restrictions are not true, you will not be able to parse HTML with regex, see the obligatory post that everyone references, for good reason.

edited May 23 '17 at 10:27

Community

1
1

answered Dec 06 '13 at 10:43

Steve P.

14,489
8
42
72

Thank you for your idea. I do know how to capture the inner of that tags. my question is how to capture all but that! ;) More delicate. – antoni Dec 06 '13 at 18:22

regexp: match all but every <(pre|code|textarea)>(.*?) in an html document

3 Answers3