Obvious a straight forward string cutting is not suitable for your second image:
...
<figure>
<img class="aligncenter" src="http://brickhut.files.wordpress.com/2011/05/theempirestrikesback1.jpg" alt="" width="540" height="800" />
<figcaption></figcaption>
</figure>
Cutting after the image would leave unclosed elements:
...
<figure>
<img class="aligncenter" src="http://brickhut.files.wordpress.com/2011/05/theempirestrikesback1.jpg" alt="" width="540" height="800" />
Which could destroy the rendering of the page inside the browser. And it does not play a role if you use preg_match
with a regular expression here or some string functions.
What you need is a DOM parser like DOMDocument
that is able to process the HTML:
Given some sample HTML code that is similar to yours in question:
$html = <<<HTML
dolor sit amet, consectetuer adipiscing elit. <img src="http://example.com/img-a.jpg"> Aenean commodo
ligula eget dolor. Aenean massa. Cum sociis natoque penatibus et magnis dis parturient montes,
nascetur ridiculus mus.
<figure>
<img src="http://example.com/img-b.jpg">
<figcaption>Figure Caption</figcaption>
</figure>
Donec quam felis, ultricies nec, pellentesque eu, pretium quis, sem. Nulla consequat massa quis enim. Donec pede justo, fringilla vel, aliquet nec, vulputate eget, arcu. In enim justo, rhoncus ut.
HTML;
You can now use the DOMDocument
class to load the HTML chunk inside a <body>
tag - because it's your whole html body for the manipulation. As you use non-standard HTML tags (<figure>
& <figcaption>
) you should disable warnings about those when loading the string with libxml_use_internal_errors
:
$doc = new DOMDocument();
libxml_use_internal_errors(1);
$doc->loadHTML(sprintf('<body>%s</body>', $html));
This is the basic setup of the DOM parser, your HTML is now inside the parser. Now comes the interesting part. You want to create the excerpt until the second image of the document. That means, everything after that element should be removed. Sounds as easy as like cutting a string which we know does not work, but this time the DOM parser does all the work for us.
You only need to obtain all nodes (<tag>
, Text, <!-- comments -->
, ...) and delete them. All nodes after the second <img>
tag in (following document order). Such things can be expressed with XPath:
/descendant::img[position()=2]/following::node()
PHP's DOM parser comes with XPath, so let's do it:
$xp = new DOMXPath($doc);
$delete = $xp->query('/descendant::img[position()=2]/following::node()');
foreach ($delete as $node)
{
$node->parentNode->removeChild($node);
}
The only thing left is to obtain (exemplary output) the excerpt that is left over. As we know it's all inside the <body>
tag:
foreach ($doc->getElementsByTagName('body')->item(0)->childNodes as $child)
{
echo $doc->saveHTML($child);
}
Which will give you the following:
dolor sit amet, consectetuer adipiscing elit. <img src="http://example.com/img-a.jpg"> Aenean commodo
ligula eget dolor. Aenean massa. Cum sociis natoque penatibus et magnis dis parturient montes,
nascetur ridiculus mus.
<figure><img src="http://example.com/img-b.jpg"></figure>
As this example shows, the <figure>
tag is properly closed now.
A similar scenario is to create an excerpt after a specific text-length or word-count: Wordwrap / Cut Text in HTML string
maybe, but nothing so messed up as you say
– Damien Pirsy Feb 24 '12 at 19:40