0

I am trying to get the source for images on some pages but there are some differences between the code of the two pages.

Page 1 Code:

<img class="thumb thumb_0" onclick="setImage(0); return false;" src="http://example.com/b1.jpg">

Page 2 Code:

<img style="width: 46px ! important; height: 46px ! important;" class="thumb thumb_0" onclick="setImage(0); return false;" src="http://example.com/image4.jpg">

Notice the difference between the 2 pages... Page 2 has a stupid style at the beginning of the img tag. Also, the "onclick" is located in a different position. The only thing I need to snag is the image location.

Here is the code that I have thus far... which only works for page 1 scenario:

preg_match_all("/<img\s*?class='thumb.*?'.*?src='(.*?)'.*?\/>/is", $hotelPage, $thumbs, PREG_PATTERN_ORDER);

Ideally, I would like to be able to keep it in one php line. How can I do an "or" in preg_replace and how can I get the regex to work for page 2 as well?

Thank you in advance!

UPDATE: The pages have other images, I am only looking for the ones that have a class that contains "thumb". I apologize for leaving out that heavy detail.

NotJay
  • 3,919
  • 5
  • 38
  • 62
  • 1
    And once again I can't stress enough how things can go wrong while using regexes to parse HTML content. In this case regexes could fail solely on the fact that the order in which tag attributes are parsed is not the same as expected in regex rules! Use a DOM parser! No excuse for not doing it. –  Jul 11 '12 at 20:13

4 Answers4

2

There are multiple regex examples around the net regarding HTML attributes. One that should work for your two specific cases, as well as just about any other image-src would be:

preg_match_all("/<img[^>]+src\\s*=\\s*['\"]([^'\"]+)['\"][^>]*>/", $hotelPage, $thumbs);

Details regarding this specific regex can be found here: Regular expression to get an attribute from HTML tag

A more modified version, to handle the 'class="thumb*"' rule would be:

preg_match_all("/<img[^>]+class=\"thumb[^\"]*\"[^>]+src\\s*=\\s*['\"]([^'\"]+)['\"][^>]*>/", $hotelPage, $thumbs);
Community
  • 1
  • 1
newfurniturey
  • 37,556
  • 9
  • 94
  • 102
1

This should work as you intended - if your html is in $html, the regular expression should look like the one $reg :

$html='some html <img class="thumb thumb_0" onclick="setImage(0); return false;"
   src="http://example.com/b1.jpg"> xxx yyy <img class="bummer thumb_0"
   onclick="setImage(0); return false;" src="http://example.com/bummer.jpg">
   <img style="width: 46px ! important; height: 46px ! important;"
   class="thumb thumb_0" onclick="setImage(0); return false;"
   src="http://example.com/image4.jpg"> some html';

$reg = ' <img .+?                # img tag
         class="thumb .+?        # class tag
         src="([^"]+)            # capture src
       ';

preg_match_all("/$reg/xis", $html, $thumbs, PREG_SET_ORDER);

foreach($thumbs as $t) echo $t[1]."\n";

It matches only if the order of attributes is {class, src} and if it found both the img-tag, and the correct class "thumb". Here we go:

http://example.com/b1.jpg
http://example.com/image4.jpg

Only two of three img entries match (I included a third, wrong link in the test set).

Regards

rbo

rubber boots
  • 14,924
  • 5
  • 33
  • 44
  • Thank you, this is along the lines of what I was looking for. After some modifications, it worked pretty well with my script. – NotJay Jul 12 '12 at 18:55
0

If all you want is the src, then you should just ignore everything else in your regex.

Try:

/<img\s.*src='(.*)'.*>/iu

as your regex.

Palladium
  • 3,723
  • 4
  • 15
  • 19
  • I am sorry, I forgot to mention that the page has other images... I am looking for the ones that have a class that starts with thumb – NotJay Jul 11 '12 at 20:11
0

It is not recommended to use regular expressions to parse xml/html. You should see this question: RegEx match open tags except XHTML self-contained tags

What you can do is use something like DOMDocument to figure out the urls:

$html = '<img class="thumb thumb_0" onclick="setImage(0); return false;" src="http://example.com/b1.jpg">
<img style="width: 46px ! important; height: 46px ! important;" class="thumb thumb_0" onclick="setImage(0); return false;" src="http://example.com/image4.jpg">';

$dom = new DOMDocument();
$dom->loadHTML($html);
$images = $dom->getElementsByTagName('img');

$image_urls = array();
foreach ($images as $image) {

    // only match images with class thumb
    if (strpos(' ' . $image->getAttribute('class') . ' ', ' thumb ') !== false) {
        $image_urls[] = $image->getAttribute('src');
    }
}

var_dump($image_urls);
Community
  • 1
  • 1
Craig
  • 2,684
  • 27
  • 20