Get all images and all except images with regex

Question

I have an article with text and multiple images in it and need to get just images and just text, separately.

Now I have this code and it just returns last image in article:

preg_match('/<img.+src=[\'"](?P<src>.+?)[\'"].*>/i', $article, $img);

How to select all images and do reverse for getting just text?

Thank you

Don't use regex for this, use a DOM parser. http://stackoverflow.com/a/1732454/362536 — Brad, Apr 30 '14 at 22:46
possible duplicate of [PHP preg\_match to find multiple occurrences](http://stackoverflow.com/questions/2029976/php-preg-match-to-find-multiple-occurrences) — Anonymous, Apr 30 '14 at 22:46
python+beautifulsoup? soup.find_all('img')..['src']? soup.text? I can provide more detail if you can give me a sample data and I can write some POC code — B.Mr.W., Apr 30 '14 at 22:59

score 1 · Answer 1 · answered Apr 30 '14 at 22:47

1

$text = preg_replace('/<img.+src=[\'"](?P<src>.+?)[\'"].*>/i', '', $article);
preg_match_all('/<img.+src=[\'"](?P<src>.+?)[\'"].*>/i', $article, $images);

//use $images and $text

answered Apr 30 '14 at 22:47

Sean Johnson

5,567
2
17
22

1

In HTML parsing, you almost never want to use a greedy match `.+`. Use an ungreedy one `.+?` – HamZa Apr 30 '14 at 22:53

Casimir et Hippolyte · Accepted Answer · 2014-05-01T00:47:06.450

1

You can use the DOM for that:

$imgSrc = array();
$txt = '';

$dom = new DOMDocument();
@$dom->loadHTML($article);

$imgs = $dom->getElementsByTagName('img');

foreach ($imgs as $img) {
    $imgSrc[] = $img->getAttribute('src');
}

$xpath = new DOMXPath($dom);
$textNodes = $xpath->query('//*[not(self::script) and not(self::style)]/text()');
foreach ($textNodes as $textNode) {
    $tmp = trim($textNode->textContent);
    $txt .= (empty($tmp)) ? '' : $tmp . PHP_EOL;

}

XPath query details:

// means anywhere in the DOM tree
* means all tag nodes
[.....] defines a condition
not(self::script) : the current node must not be a script node
text() returns the text node

edited May 01 '14 at 00:47

answered Apr 30 '14 at 23:14

Casimir et Hippolyte

88,009
5
94
125

Thanks, $imgSrc is working now but $txt not. I am getting $article with mysql select and its processed before its displayed on page, its not already displayed data. Is that a problem or something else? – Jakob Apr 30 '14 at 23:27
just second closing bracket in $textNodes is missing :) – Jakob May 01 '14 at 00:03
@Yesian_: indeed, only one suffice. – Casimir et Hippolyte May 01 '14 at 00:47

Get all images and all except images with regex

2 Answers2