0

I have this string (in html):

<div class="sliderImg">
    <img width="1000" height="666" src="/consultants/images/projectbank/simansi-vaseon.jpg">
    <img width="1000" height="666" src="/consultants/images/projectbank/oloklirosi-parkou.jpg">
    <img width="1000" height="666" src="/consultants/images/projectbank/inverters.jpg">
</div>

<div class="projectProperties">
    <ul>
        <li class="projCategory">Project category: <span class="text">Energy</span></li>
        <li class="projEntity">Entity: <span class="text">Bright Wind and Solar</span></li>
        <li class="projRegion">Region: <span class="text">Southwest</span></li>
        <li class="projYear">Year: <span class="text">2010</span></li>
        <li class="projStatus">Status: <span class="text">Complete</span></li>
        <li class="projContribution">Contribution: <span class="text">Study and construction</span></li>
    </ul>
</div>

<div class="projectDesc">
    <p>Duis lectus arcu, auctor scelerisque diam a, hendrerit sagittis risus. Donec eget urna metus. Nulla sapien felis, vehicula vel convallis et, facilisis a nunc. Donec ac diam ut nisl rutrum convallis. Phasellus pellentesque turpis sit nullam.</p>
</div>

and I would like to keep only the last div with class projectDesc, using preg_replace and regex.

<div class="projectDesc">
    <p>Duis lectus arcu, auctor scelerisque diam a, hendrerit sagittis risus. Donec eget urna metus. Nulla sapien felis, vehicula vel convallis et, facilisis a nunc. Donec ac diam ut nisl rutrum convallis. Phasellus pellentesque turpis sit nullam.</p>
</div>

I searched through many posts in SO but I can't find anything related as to what sort of regex should I use. Can you please point me to the right direction, if this is even possible using only preg_replace and regex?

otinanai
  • 3,987
  • 3
  • 25
  • 43

3 Answers3

2

You want to extract the final div from that string of HTML? First off, do not use regex. Using regex on HTML or XML is a recipe for an increased bill at the pharmacy to deal with the headaches that are the inevitable consequence. (And you still won't have built a stable and reliable way of processing the HTML.)

The best solution is to use the PHP feature designed for processing HTML/XML: DOMDocument.

Now, your HTML document as you have submitted it is actually illegal, because it has multiple root elements. So I'm going to wrap it in another tag just in order to manipulate it.

$dom = new DOMDocument;
$dom->loadHTML('<body>' . $html . '</body>');

$xpath = new DOMXPath($dom);

$elements = $xpath->query('//div[@class="projectDesc"]');

$output = $dom->saveHTML($elements->item(0));
lonesomeday
  • 233,373
  • 50
  • 316
  • 318
  • ...although my HTML is NOT illegal. It's a regular HTML string coming from a CMS. The parent wrap was not necessary. – otinanai Nov 06 '13 at 15:58
  • @otinanai How interesting: you're right. It's not a valid HTML document (though of course it is valid HTML), and I thought DOMDocument wouldn't cope with it. You're right that it does. – lonesomeday Nov 06 '13 at 16:07
  • when it comes to Greek characters change `$dom->saveHTML($elements->item(0))` to `utf8_decode($dom->saveHTML($elements->item(0)))` – otinanai Nov 12 '13 at 22:56
1

DO NOT use regular expressions to parse HTML

You want to use PHP Simple HTML DOM.

$string = "your HTML block that you posted.";

$html = str_get_html($string);
$html->find('div[class=projectDesc]', 0)->innertext;
David
  • 3,831
  • 2
  • 28
  • 38
  • why not to use regular expressions? – otinanai Nov 06 '13 at 14:58
  • 1
    I like [this](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) answer better :) – Kemal Fadillah Nov 06 '13 at 15:02
  • But how does DOM parsers parse HTML? – TiMESPLiNTER Nov 06 '13 at 15:05
  • Using REGEX, but they have coded for the hundreds of different exceptions and possibilities with HTML and abstract all the functions that you don't want to code. – David Nov 06 '13 at 15:07
  • 2
    @TiMESPLiNTER It's not with regex. Parsing is done with tokenising instead. If you're really interested (NB it's quite boring) you can read about HTML5 parsing in [the W3C spec](http://www.w3.org/html/wg/drafts/html/master/syntax.html#parsing). – lonesomeday Nov 06 '13 at 15:15
  • But you can also tokenize with regex can't you? – TiMESPLiNTER Nov 06 '13 at 15:23
1

This regex will match the div your are looking for

/(<div class="projectDesc"\>.*?<\/div>)/ims
TiMESPLiNTER
  • 5,741
  • 2
  • 28
  • 64