PHP preg_match everything until

Question

I have a blog entry that will sometimes contain a lot of text/images, and I want to cut an excerpt from that blog. To be more specific I want to match everything until after the second image tag

below is some sample text.

I've tried a negative lookaheads like

/[\w\r\n;:',."&\s*<>=-_]+(?!<img)/i

but I can't figure out a way to have the lookahead apply to a '+' modifier. Anyone got any clue, I'd be real grateful.

*override*
I've been stuck in a room lately, and though it's hard to stay creative all the time,         sometimes you need that extra kick. Well for some us we have to throw pictures of true creative genius at ourselves to stimulate us.

So sit back and soak in some inspiration I've come across the past year.

&nbsp;

&nbsp;

&nbsp;

<figure>
    <a href="">
    <img class="aligncenter" src="http://funnypagenet.com/wp-content/uploads/2011/07/Talesandminimalism_12_www.funnypagenet.com_.jpg" alt="" width="574" height="838" />
    </a>
    <figcaption></figcaption>
</figure>

&nbsp;

&nbsp;

&nbsp;

&nbsp;
<h4 style="text-align: center;">
    <a href="http://funnypagenet.com/tales-and-minimalism/">source</a>
</h4>
Couldn't find who did this, but couldn't explain the movie any simpler

&nbsp;

&nbsp;

&nbsp;

&nbsp;

&nbsp;

&nbsp;

&nbsp;

&nbsp;

&nbsp;

<figure>
    <img class="aligncenter" src="http://brickhut.files.wordpress.com/2011/05/theempirestrikesback1.jpg" alt="" width="540" height="800" />
    <figcaption></figcaption>
</figure>

&nbsp;

&nbsp;

&nbsp;

Can be easily done using DOM, any reason why you prefer regex based solution? — anubhava, Feb 24 '12 at 19:33
@Shiplu Looks not that bad...just lacks some
maybe, but nothing so messed up as you say — Damien Pirsy, Feb 24 '12 at 19:40
@Damien Oops! i didnt inspect it thoroughly. The structure seems okay. Just too many *non breaking spaces* — Shiplu Mokaddim, Feb 24 '12 at 19:41
the reason I don't want to use the DOM is because I want to do as much as possible on the server end before I bring it to the front end — Marius Miliunas, Feb 25 '12 at 00:42

score 3 · Accepted Answer · edited May 23 '17 at 09:59

Obvious a straight forward string cutting is not suitable for your second image:

...
<figure>
    <img class="aligncenter" src="http://brickhut.files.wordpress.com/2011/05/theempirestrikesback1.jpg" alt="" width="540" height="800" />
    <figcaption></figcaption>
</figure>

Cutting after the image would leave unclosed elements:

...
<figure>
    <img class="aligncenter" src="http://brickhut.files.wordpress.com/2011/05/theempirestrikesback1.jpg" alt="" width="540" height="800" />

Which could destroy the rendering of the page inside the browser. And it does not play a role if you use preg_match with a regular expression here or some string functions.

What you need is a DOM parser like DOMDocument that is able to process the HTML:

Given some sample HTML code that is similar to yours in question:

$html = <<<HTML
dolor sit amet, consectetuer adipiscing elit. <img src="http://example.com/img-a.jpg"> Aenean commodo 
ligula eget dolor. Aenean massa. Cum sociis natoque penatibus et magnis dis parturient montes, 
nascetur ridiculus mus.

<figure>
    <img src="http://example.com/img-b.jpg">
    <figcaption>Figure Caption</figcaption>
</figure>

Donec quam felis, ultricies nec, pellentesque eu, pretium quis, sem. Nulla consequat massa quis enim. Donec pede justo, fringilla vel, aliquet nec, vulputate eget, arcu. In enim justo, rhoncus ut.
HTML;

You can now use the DOMDocument class to load the HTML chunk inside a <body> tag - because it's your whole html body for the manipulation. As you use non-standard HTML tags (<figure> & <figcaption>) you should disable warnings about those when loading the string with libxml_use_internal_errors:

$doc = new DOMDocument();
libxml_use_internal_errors(1);
$doc->loadHTML(sprintf('<body>%s</body>', $html));

This is the basic setup of the DOM parser, your HTML is now inside the parser. Now comes the interesting part. You want to create the excerpt until the second image of the document. That means, everything after that element should be removed. Sounds as easy as like cutting a string which we know does not work, but this time the DOM parser does all the work for us.

You only need to obtain all nodes (<tag>, Text, , ...) and delete them. All nodes after the second <img> tag in (following document order). Such things can be expressed with XPath:

/descendant::img[position()=2]/following::node()

PHP's DOM parser comes with XPath, so let's do it:

$xp = new DOMXPath($doc);
$delete = $xp->query('/descendant::img[position()=2]/following::node()');
foreach ($delete as $node)
{
    $node->parentNode->removeChild($node);
}

The only thing left is to obtain (exemplary output) the excerpt that is left over. As we know it's all inside the <body> tag:

foreach ($doc->getElementsByTagName('body')->item(0)->childNodes as $child)
{
    echo $doc->saveHTML($child);
}

Which will give you the following:

dolor sit amet, consectetuer adipiscing elit. <img src="http://example.com/img-a.jpg"> Aenean commodo 
ligula eget dolor. Aenean massa. Cum sociis natoque penatibus et magnis dis parturient montes, 
nascetur ridiculus mus.

<figure><img src="http://example.com/img-b.jpg"></figure>

As this example shows, the <figure> tag is properly closed now.

A similar scenario is to create an excerpt after a specific text-length or word-count: Wordwrap / Cut Text in HTML string

Oh man, I've got a lot of reading to do until this'll make sense to me. Thanks for the elaborate response — Marius Miliunas, Feb 25 '12 at 00:44

Mr. Llama · Answer 2 · 2012-02-24T21:31:07.357

1

Well, it's not regex, but it should work:

$post = str_ireplace('<img', '!!!<img', $post);
list($p1, $p2) = explode('!!!', $post);
$keep = $p1 . $p2;

Puts a split marker before the image tags (!!!), splits on them and keeps the first two chunks, which should be everything up to the second image tag. No regex required.

Edit: Because this is for a excerpt, you might want to run strip_tags() on the result. It's possible that if you don't, you'll have some opened HTML tags that never get closed.

edited Feb 24 '12 at 21:31

answered Feb 24 '12 at 19:40

Mr. Llama

20,202
2
62
115

I solved my problem doing something similar. Too bad I gotta wait 7 hours to post my answer – Marius Miliunas Feb 24 '12 at 20:22

score 0 · Answer 3 · answered Feb 24 '12 at 19:41

0

If you really want regex based solution then here it is:

// assuming $str is your full HTML text
if ( preg_match_all('~^(.*?<img\s.*?<img\s[^>]*>)~si', $str, $m) )
    print_r ( $m[1] );

answered Feb 24 '12 at 19:41

anubhava

761,203
64
569
643

What if there's no second image tag? – Mr. Llama Feb 24 '12 at 21:30
@GigaWatt: If you read OP's question you will note: `I want to match everything until after the second image tag` – anubhava Feb 25 '12 at 03:46
that's weird, when I checked on a different tester, it showed absolutely everything, I'll have to look at it again – Marius Miliunas Feb 27 '12 at 15:12

PHP preg_match everything until

3 Answers3