1

I have a blog entry that will sometimes contain a lot of text/images, and I want to cut an excerpt from that blog. To be more specific I want to match everything until after the second image tag

below is some sample text.

I've tried a negative lookaheads like

/[\w\r\n;:',."&\s*<>=-_]+(?!<img)/i

but I can't figure out a way to have the lookahead apply to a '+' modifier. Anyone got any clue, I'd be real grateful.

*override*
I've been stuck in a room lately, and though it's hard to stay creative all the time,         sometimes you need that extra kick. Well for some us we have to throw pictures of true creative genius at ourselves to stimulate us.

So sit back and soak in some inspiration I've come across the past year.

&nbsp;

&nbsp;

&nbsp;

<figure>
    <a href="">
    <img class="aligncenter" src="http://funnypagenet.com/wp-content/uploads/2011/07/Talesandminimalism_12_www.funnypagenet.com_.jpg" alt="" width="574" height="838" />
    </a>
    <figcaption></figcaption>
</figure>

&nbsp;

&nbsp;

&nbsp;

&nbsp;
<h4 style="text-align: center;">
    <a href="http://funnypagenet.com/tales-and-minimalism/">source</a>
</h4>
Couldn't find who did this, but couldn't explain the movie any simpler

&nbsp;

&nbsp;

&nbsp;

&nbsp;

&nbsp;

&nbsp;

&nbsp;

&nbsp;

&nbsp;

<figure>
    <img class="aligncenter" src="http://brickhut.files.wordpress.com/2011/05/theempirestrikesback1.jpg" alt="" width="540" height="800" />
    <figcaption></figcaption>
</figure>

&nbsp;

&nbsp;

&nbsp;
PeeHaa
  • 71,436
  • 58
  • 190
  • 262
Marius Miliunas
  • 1,023
  • 18
  • 34

3 Answers3

3

Obvious a straight forward string cutting is not suitable for your second image:

...
<figure>
    <img class="aligncenter" src="http://brickhut.files.wordpress.com/2011/05/theempirestrikesback1.jpg" alt="" width="540" height="800" />
    <figcaption></figcaption>
</figure>

Cutting after the image would leave unclosed elements:

...
<figure>
    <img class="aligncenter" src="http://brickhut.files.wordpress.com/2011/05/theempirestrikesback1.jpg" alt="" width="540" height="800" />

Which could destroy the rendering of the page inside the browser. And it does not play a role if you use preg_match with a regular expression here or some string functions.

What you need is a DOM parser like DOMDocument that is able to process the HTML:

Given some sample HTML code that is similar to yours in question:

$html = <<<HTML
dolor sit amet, consectetuer adipiscing elit. <img src="http://example.com/img-a.jpg"> Aenean commodo 
ligula eget dolor. Aenean massa. Cum sociis natoque penatibus et magnis dis parturient montes, 
nascetur ridiculus mus.

<figure>
    <img src="http://example.com/img-b.jpg">
    <figcaption>Figure Caption</figcaption>
</figure>

Donec quam felis, ultricies nec, pellentesque eu, pretium quis, sem. Nulla consequat massa quis enim. Donec pede justo, fringilla vel, aliquet nec, vulputate eget, arcu. In enim justo, rhoncus ut.
HTML;

You can now use the DOMDocument class to load the HTML chunk inside a <body> tag - because it's your whole html body for the manipulation. As you use non-standard HTML tags (<figure> & <figcaption>) you should disable warnings about those when loading the string with libxml_use_internal_errors:

$doc = new DOMDocument();
libxml_use_internal_errors(1);
$doc->loadHTML(sprintf('<body>%s</body>', $html));

This is the basic setup of the DOM parser, your HTML is now inside the parser. Now comes the interesting part. You want to create the excerpt until the second image of the document. That means, everything after that element should be removed. Sounds as easy as like cutting a string which we know does not work, but this time the DOM parser does all the work for us.

You only need to obtain all nodes (<tag>, Text, <!-- comments -->, ...) and delete them. All nodes after the second <img> tag in (following document order). Such things can be expressed with XPath:

/descendant::img[position()=2]/following::node()

PHP's DOM parser comes with XPath, so let's do it:

$xp = new DOMXPath($doc);
$delete = $xp->query('/descendant::img[position()=2]/following::node()');
foreach ($delete as $node)
{
    $node->parentNode->removeChild($node);
}

The only thing left is to obtain (exemplary output) the excerpt that is left over. As we know it's all inside the <body> tag:

foreach ($doc->getElementsByTagName('body')->item(0)->childNodes as $child)
{
    echo $doc->saveHTML($child);
}

Which will give you the following:

dolor sit amet, consectetuer adipiscing elit. <img src="http://example.com/img-a.jpg"> Aenean commodo 
ligula eget dolor. Aenean massa. Cum sociis natoque penatibus et magnis dis parturient montes, 
nascetur ridiculus mus.

<figure><img src="http://example.com/img-b.jpg"></figure>

As this example shows, the <figure> tag is properly closed now.

A similar scenario is to create an excerpt after a specific text-length or word-count: Wordwrap / Cut Text in HTML string

Community
  • 1
  • 1
hakre
  • 193,403
  • 52
  • 435
  • 836
1

Well, it's not regex, but it should work:

$post = str_ireplace('<img', '!!!<img', $post);
list($p1, $p2) = explode('!!!', $post);
$keep = $p1 . $p2;

Puts a split marker before the image tags (!!!), splits on them and keeps the first two chunks, which should be everything up to the second image tag. No regex required.

Edit: Because this is for a excerpt, you might want to run strip_tags() on the result. It's possible that if you don't, you'll have some opened HTML tags that never get closed.

Mr. Llama
  • 20,202
  • 2
  • 62
  • 115
0

If you really want regex based solution then here it is:

// assuming $str is your full HTML text
if ( preg_match_all('~^(.*?<img\s.*?<img\s[^>]*>)~si', $str, $m) )
    print_r ( $m[1] );
anubhava
  • 761,203
  • 64
  • 569
  • 643