-1

Good day!

I am trying to write a bit more difficult regex, but without success :( I try to match html from starting

<div class="about">

and count closing

</div>

tags. So to match everything in between.

I wrote a regex, but it is not performing. I guess I am missing something like that counts of instances could have anything in between them. I tried to google it but the might of regex is obviously tough for newbies.

<div class="about">[\s\S]*(<\/div>){2} 

Help and advice appreciated.

  • 2
    Regex is not a correct tool for this task. You will probably want to use an HTML parser. See [this relevant answer](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454). – Aaron Jun 29 '17 at 15:14
  • **Don't use Regex to match HTML** – Tom Lord Jun 29 '17 at 15:17
  • You can, but it's a pain in the a**. – revo Jun 29 '17 at 15:18
  • Your regex assumes the HTML contains ``. What if there are spaces in between? Or other HTML elements? Use the DOM Parser, not Regex. – Tom Lord Jun 29 '17 at 15:19
  • @TomLord Exactly this is the case making my head swell. Can I use this http://simplehtmldom.sourceforge.net/ If yes, how do I call it from a wordpress template? Can I just include it or it is something more difficult. Ok this post goes towards different topic. – user2047710 Jun 29 '17 at 16:18
  • @Aaron note taken! Lesson learned. – user2047710 Jun 29 '17 at 16:19
  • @user2047710 Yes. Use that. Or another DOM Parser library. You'll need to do something like `$html->find("//div[@class='about']")`. Have a go, and ask for help if needed, but stay away from regex for "complex" HTML parsing like this :) – Tom Lord Jun 29 '17 at 16:31

2 Answers2

0

As others have said, you should avoid regexes in many cases where there exists a better parser (whether HTML, CSS, CSV, or whatever) that works for your use-case.

The reason for this is that the data may be tree structured, and might have some of the things you're looking for within other elements; for example, within <!-- --> comments. And then you have to exclude those. Which means recognizing when a comment really is a comment, and it rapidly becomes a mess.

But there are use-cases where such a parser is overkill. If you want a quick guesstimate, from a commandline command rather than a script you'll be using forever and sharing with others, regexes can still be your friend.

Something like this:

<div class="about">([\s\S]*?<\/div>)*

This will capture not only the divs within the "about" div, but every closing div tag in the remainder of the page, whether it's commented out or not (along with any separating tags and whitespace and other stuff). If yours is a simple enough case that this is all you want, then that's fine.

But if you want anything complex, then you'll rapidly venture into recursive regexes, with conditionals, and then the pain starts; the DOM tree parser will become the better option, long before you reach that point.

Dewi Morgan
  • 1,143
  • 20
  • 31
0

First thanks to everybody sharing time and knowledge. With your help I did the job with

<div class="about">([\s\S]*?<\/div>){6}

{6} is the count of closing div tag. However, what is more important that you gave me the clues that this will work until html page structure changes and to make it permanent I should use a DOM parser.