What does this regex in php mean?

Question

Can somebody please explain me what this regex means?

#<hr(.*)class="system-pagebreak"(.*)\/>#iU

Is there a tool to convert these regular expresions to normal words?

This tool very clearly explains each symbol(token) http://regex101.com/. You may have to seperate flags and delimiters first. — gskema, Feb 07 '14 at 12:53

user229044 · Answer 1 · 2014-02-07T12:54:56.137

It is attempting* to match any <hr> tags that have class="system-pagebreak" attributes.

The (.*) segments between hr and class and the closing /> match "zero or more characters", so it can match things like

<hr id="what" class="system-pagebreak" style="display:block" />

The #iU at the end make it case-insensitive (i) and ungreedy (U) so that the .* matches won't eat up the whole document.

Is there a tool to convert these regular expresions to normal words?

Not really? What can you mean by "normal words"? That's a very straight forward regex, and you can't "convert" it to anything else without losing its meaning. There are plenty of sites for testing regular expressions though, such as Regex101.

*Note that I say attempting because this is a really bad way of attempting to interact with (X)HTML, and is sure to break eventually. You should use a DOM-parser.

and here comes the standard link to the ages old answer on parsing html with regex: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 — Vogel612, Feb 07 '14 at 13:05

Vogel612 · Accepted Answer · 2014-02-07T13:02:19.207

This regex matches any self-closing hr with class "sytem-pagebreak", but not with additional classes.

the "actual" regex is the part between #
the iU behind that is two "flags" specifying, how the regex will behave. the i means that the regex will be case-insensitive, the U means that the regex qualifiers are lazy by default.

the first part of the regex (<hr) will be evaluated as a String literal. it matches any combination like:

- <hr
- <Hr
- <hR
- <HR

then follows a group evaluation (marked by the ()). Evaluated will be the special char . (any character) that will be matched as many times as it goes.

then follows a literal string evaluation for class="system-pagebreak". This will not match things like these:

class="system-pagebreak someclass"
class ="system-pagebreak"

after that there is again any char as often as it comes and then a literal match for />. The backslash is just for escaping the slash from the regex (as it is also a special char).

score 0 · Answer 3 · answered Feb 07 '14 at 12:51

It will match <hr> tags with class="system-pagebreak" attribute. It will also capture anything between hr and class and between the second quotation mark and the end of the tag (/>). / escapes the slash. i makes it insensitive and U ungreedy. The pound (#) signs mark the beginning and end of the pattern.

score 0 · Answer 4 · answered Feb 07 '14 at 12:52

Is there a tool to convert these regular expresions to normal words?

You can use a tool like www.regexper.com to visualize the regex: http://www.regexper.com/#%23%3Chr(.)class%3D%22system-pagebreak%22(.)%5C%2F%3E%23 This helps understandig it.

Can somebody please explain me what this regex means?

There are already enough good answers :)

score 0 · Answer 5 · answered Feb 07 '14 at 12:56

This regex will match all characters on the same line after <hr until class="system-pagebreak" will be met, and put it in the first capturing group. And then, it will put all characters (always on the same line) in the capturing group 2 until />

The goal is probably to find self closing hr tags that contains the class system-pagebreak. However it's a bad pattern since it will match too this kind of string:

<hr><div class="system-pagebreak"><img src="image.jpg" />

What does this regex in php mean?

5 Answers5