Incorrect regex for divs

Question

I'm trying to get the divs from many of my website files using regexes, but I'm failing
This is the thing I'm trying to do http://regexr.com/38to9

I need the following div with class data and more, with classes plainText and extData to actually be fitting the regex, everything inside. There's no extra divs inside the ones I listed.
I'm sitting on this for around 2 hours now and I can't figure it out.
It's the following for anyone who doesn't want to go visit that cool site

<div class="data">
    Something
</div>

<div class="data">
     Text in here
    <a class="data" href="links"><img src="whatever.png"></a>
</div>

With regex

\s*<div class="(data|plainText|extData)">\s*(...)\s*<\/div>

The first div is highlighted, the second one isn't. Nor do I get any results with preg_match_all with php. Does it have anything to do with the fact I'm using tabs in the second div and I'm not using them in the first one?
(Wrote it quickly on the website to see if it works)

[**THE PONY HE COMES**](http://stackoverflow.com/a/1732454/507674) — Niet the Dark Absol, May 29 '14 at 10:32
Also... `(...)` means "match three characters". It works for the first one because (in the Regexr code) you have exactly three characters between the spaces... — Niet the Dark Absol, May 29 '14 at 10:34
I was actually using (.*?) before and it worked just fine, soo. Anyway, I'll try parsing it according to your example instead, see how that works out — P.K., May 29 '14 at 10:40
`(.*?)` wouldn't work either because your second `
` has a newline in its content, which `.` doesn't match without the appropriate modifier. — Niet the Dark Absol, May 29 '14 at 10:43

score 2 · Answer 1 · answered May 29 '14 at 10:38

Have you tried using a parser instead?

$dom = new DOMDocument();
$dom->loadHTML($input);
$divs = $dom->getElementsByTagName('div');
foreach($divs as $div) {
  if( preg_match("/\b(data|plainText|extData)\b/",$div->getAttribute("class")) {
    // do something to the $div
    $div->setAttribute("title","I matched!");
  }
}
$out = $dom->saveHTML();

// Because DOMDocument wraps our HTML in a minimal document, we need to extract
// in this case, regex is okay because we have a known structure:
$out = preg_replace("~.*?<body>(.*)</body>.*~","$1",$out);

Mmm sorry didn't mean to steal the "correct answer" from you, let me +1 you which I should have done anyway. :) — zx81, May 29 '14 at 19:45

score 1 · Accepted Answer · answered May 29 '14 at 11:09

1

You have a great non-regex answer, but you should also know that you were really close...

With all disclaimers about parsing html with regex, adding the DOTALL modifier (?s) to your original expression matches what you want:

(?s)<div class="(data|plainText|extData)">\s*(.*?)\s*<\/div>

See demo.

How does this work?

The DOTALL modifier (?s) tells the engine that a dot can match a newline character. This is important for your (.*?) because the content of the divs can span several lines.

answered May 29 '14 at 11:09

zx81

41,100
9
89
105

Haha, wow, thanks for letting me know. Maybe I'll pick this way of doing things, since I've been reading about how regexes are better than DOM modifications. Thanks – P.K. May 29 '14 at 12:00
@P.K. Erm, no, DOM modifications are better than regex XD Except in very specific circumstances, such as in my answer where I parse out the `` contents. – Niet the Dark Absol May 29 '14 at 12:05

Incorrect regex for divs

2 Answers2