0

Can somebody please explain me what this regex means?

#<hr(.*)class="system-pagebreak"(.*)\/>#iU

Is there a tool to convert these regular expresions to normal words?

Michael Berkowski
  • 267,341
  • 46
  • 444
  • 390
Rrezarta Muja
  • 57
  • 1
  • 7
  • 1
    This tool very clearly explains each symbol(token) http://regex101.com/. You may have to seperate flags and delimiters first. – gskema Feb 07 '14 at 12:53

5 Answers5

6

It is attempting* to match any <hr> tags that have class="system-pagebreak" attributes.

The (.*) segments between hr and class and the closing /> match "zero or more characters", so it can match things like

<hr id="what" class="system-pagebreak" style="display:block" />

The #iU at the end make it case-insensitive (i) and ungreedy (U) so that the .* matches won't eat up the whole document.

Is there a tool to convert these regular expresions to normal words?

Not really? What can you mean by "normal words"? That's a very straight forward regex, and you can't "convert" it to anything else without losing its meaning. There are plenty of sites for testing regular expressions though, such as Regex101.

*Note that I say attempting because this is a really bad way of attempting to interact with (X)HTML, and is sure to break eventually. You should use a DOM-parser.

user229044
  • 232,980
  • 40
  • 330
  • 338
  • and here comes the standard link to the ages old answer on parsing html with regex: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – Vogel612 Feb 07 '14 at 13:05
1

This regex matches any self-closing hr with class "sytem-pagebreak", but not with additional classes.

the "actual" regex is the part between #
the iU behind that is two "flags" specifying, how the regex will behave. the i means that the regex will be case-insensitive, the U means that the regex qualifiers are lazy by default.

the first part of the regex (<hr) will be evaluated as a String literal. it matches any combination like:

- <hr
- <Hr
- <hR
- <HR

then follows a group evaluation (marked by the ()). Evaluated will be the special char . (any character) that will be matched as many times as it goes.

then follows a literal string evaluation for class="system-pagebreak". This will not match things like these:

  • class="system-pagebreak someclass"
  • class ="system-pagebreak"

after that there is again any char as often as it comes and then a literal match for />. The backslash is just for escaping the slash from the regex (as it is also a special char).

Vogel612
  • 5,620
  • 5
  • 48
  • 73
0

It will match <hr> tags with class="system-pagebreak" attribute. It will also capture anything between hr and class and between the second quotation mark and the end of the tag (/>). / escapes the slash. i makes it insensitive and U ungreedy. The pound (#) signs mark the beginning and end of the pattern.

Ilion
  • 6,772
  • 3
  • 24
  • 47
0

Is there a tool to convert these regular expresions to normal words?

You can use a tool like www.regexper.com to visualize the regex: http://www.regexper.com/#%23%3Chr(.)class%3D%22system-pagebreak%22(.)%5C%2F%3E%23 This helps understandig it.

Can somebody please explain me what this regex means?

There are already enough good answers :)

Reeno
  • 5,720
  • 11
  • 37
  • 50
0

This regex will match all characters on the same line after <hr until class="system-pagebreak" will be met, and put it in the first capturing group. And then, it will put all characters (always on the same line) in the capturing group 2 until />

The goal is probably to find self closing hr tags that contains the class system-pagebreak. However it's a bad pattern since it will match too this kind of string:

<hr><div class="system-pagebreak"><img src="image.jpg" /> 
Casimir et Hippolyte
  • 88,009
  • 5
  • 94
  • 125