7

Demo

I need to get the image src from the following code

HTML

<div class="avatar profile_CF48B2B4A31B43EC96F0561F498CE6BF ">
    <a onclick="">
        <img id="lazyload_-247847544_0" height="74" width="74" class="avatar potentialFacebookAvatar avatarGUID:CF48B2B4A31B43EC96F0561F498CE6BF" src="http://media-cdn.tripadvisor.com/media/photo-l/05/f3/67/c3/lilrazzy.jpg" />
    </a>
</div>

I tried writing the js:

foreach($html->find('div[class=profile_CF48B2B4A31B43EC96F0561F498CE6BF] a img') as $element) {
    $img = $element->getAttribute('src');
    echo $img;
}

But it shows src key doesn't exists. How can I scrap review avatar images?

UPDATE:

The image url is not found when I looked at the page source, But firebug shows the image url:

<img id='lazyload_1953171323_17' height='24' alt='4 helpful votes' width='25' class='icon lazy'/>

Here is my page's source code:

<div class="col1of2">
<div class="member_info">
<div id="UID_3E0FAF58557D3375508A9E5D9A7BD42F-SRC_175428572" class="memberOverlayLink" onmouseover="ta.trackEventOnPage('Reviews','show_reviewer_info_window','user_name_photo'); ta.call('ta.overlays.Factory.memberOverlayWOffset', event, this, 's3 dg rgba_gry update2012', 0, (new Element(this)).getElement('.avatar')&&(new Element(this)).getElement('.avatar').getStyle('border-radius')=='100%'?-10:0);">
<div class="avatar profile_3E0FAF58557D3375508A9E5D9A7BD42F ">
<a onclick=>
<img id='lazyload_1953171323_15' height='74' width='74' class='avatar potentialFacebookAvatar avatarGUID:3E0FAF58557D3375508A9E5D9A7BD42F'/>
</a>
</div>
<div class="username mo">
<span class="expand_inline scrname hvrIE6 mbrName_3E0FAF58557D3375508A9E5D9A7BD42F" onclick="ta.trackEventOnPage('Reviews', 'show_reviewer_info_window', 'user_name_name_click')">Prataspeles</span>
</div>
</div>
<div class="location">
Latvia
</div>
</div>
<div class="memberBadging">
<div id="UID_3E0FAF58557D3375508A9E5D9A7BD42F-CONT" class="totalReviewBadge badge no_cpu" onclick="ta.trackEventOnPage('Reviews','show_reviewer_info_window','review_count'); ta.util.cookie.setPIDCookie('15984'); ta.call('ta.overlays.Factory.memberOverlayWOffset', event, this, 's3 dg rgba_gry update2012', -10, -50);">
<div class="reviewerTitle">Reviewer</div>
<img id='lazyload_1953171323_16' height='24' alt='4 reviews' width='25' class='icon lazy'/>
<span class="badgeText">4 reviews</span>
</div>
<div id="UID_3E0FAF58557D3375508A9E5D9A7BD42F-HV" class="helpfulVotesBadge badge no_cpu" onclick="ta.trackEventOnPage('Reviews','show_reviewer_info_window','helpful_count'); ta.util.cookie.setPIDCookie('15983'); ta.call('ta.overlays.Factory.memberOverlayWOffset', event, this, 's3 dg rgba_gry update2012', -22, -50);">
<img id='lazyload_1953171323_17' height='24' alt='4 helpful votes' width='25' class='icon lazy'/>
<span class="badgeText">4 helpful votes</span>
</div>
</div>
</div> 

Is there any problem because of using lazyload?

UPDATE 2

Using lazyload makes my images load once the pages are loaded, i tried getting image ids and compare them with the lazyload js array, but this id doesn't match with the lazyload var array.

Question:

How to get this js array from this JSON?

Example:

{"id":"lazyload_-205858383_0","tagType":"img","scroll":true,"priority":100,"data":"http://media-cdn.tripadvisor.com/media/photo-l/05/f3/67/c3/lilrazzy.jpg"}
,   {"id":"lazyload_-205858383_1","tagType":"img","scroll":true,"priority":100,"data":"http://c1.tacdn.com/img2/icons/gray_flag.png"}
,   {"id":"lazyload_-205858383_2","tagType":"img","scroll":true,"priority":100,"data":"http://media-cdn.tripadvisor.com/media/photo-l/01/2a/fd/98/avatar.jpg"}
,   {"id":"lazyload_-205858383_3","tagType":"img","scroll":true,"priority":100,"data":"http://c1.tacdn.com/img2/icons/gray_flag.png"}
,   {"id":"lazyload_-205858383_4","tagType":"img","scroll":true,"priority":100,"data":"http://media-cdn.tripadvisor.com/media/photo-l/01/2e/70/5e/avatar036.jpg"}
,   {"id":"lazyload_-205858383_5","tagType":"img","scroll":false,"priority":100,"data":"http://c1.tacdn.com/img2/badges/badge_helpful.png"}
Aravind
  • 609
  • 6
  • 14
Kārlis Millers
  • 664
  • 2
  • 11
  • 29
  • 1
    You are having difficulty because javascipt is used to lazy load the image once the page is loaded. Use phpDom to find the Id of the element, and then use regular expression to find the relevant images based on this Id. – Kami Jul 04 '14 at 09:29
  • @Kami bt how to parse javascript? – Kārlis Millers Jul 04 '14 at 10:39

5 Answers5

4

You are having difficulty because javascipt is used to lazy load the image once the page is loaded. Use phpDom to find the Id of the element, and then use regular expression to find the relevant images based on this Id.

To achieve this, try something like :

$json = json_decode("<JSONSTRING HERE>");

foreach($html->find('div[class=profile_CF48B2B4A31B43EC96F0561F498CE6BF] a img') as $element) {
   $imgId = $element->getAttribute('id');

   foreach ($json as $lazy)
   {
      if ($lazy["id"] == $imgId) echo $lazy["data"];
   }
}

The above is untested so you will need to resolve the kinks. They key is to extract the relevant javascript and convert it to json.

Alternatively, you can use string search functions to get the row which contains the information about the img, and extract the required value.

Kami
  • 19,134
  • 4
  • 51
  • 63
  • 1
    @KārlisMillers I do not have access to php at the moment to give a more concrete working example, but you can look at using phpdom to search for script tags, extracting their content, or search for the id string in the original html - there should only be two matches, one for the control, the other for the lazyload, or use regular expression to extract out the JSON array and then use the above pseudo code. – Kami Jul 04 '14 at 11:21
  • Thanks for Idea. My final version in my answer post. – Kārlis Millers Jul 18 '14 at 09:10
3

If you're looking for all IDs that contain the substring, "lazyload", you might try the wildcard selector and upon a hit look at the 'src' property of the element found. See the jsfiddle below. Good luck!

$(document.body).find('img[id*=lazyload]').each(function() {
   console.log($(this).prop('src'));
});

Jsfiddle

J. LaRosee
  • 993
  • 3
  • 11
  • 23
  • 3
    Can you please add some explanation? Code-only answers are (sometimes) good, but code + explanation is (most times) better – Barranka Jul 10 '14 at 18:55
1

Try this -

foreach($html->find('div[class=profile_CF48B2B4A31B43EC96F0561F498CE6BF ] a img') as $element) {
$img = $element->getAttribute('src');
echo $img;
}

There is space after the class name. You have to add space at the end of class name.

OR

use even full class name

$html->find('div[class=avatar profile_CF48B2B4A31B43EC96F0561F498CE6BF ] a img'

TBI
  • 2,789
  • 1
  • 17
  • 21
1

Use jQuery selectors i.e. $('#lazyload_-247847544_0') and you can get the image source using this

var src = $('#lazyload_-247847544_0').attr('src');

Or more specifically

$('.profile_CF48B2B4A31B43EC96F0561F498CE6BF #lazyload_-247847544_0').attr('src');

Thanks

Soner Gönül
  • 97,193
  • 102
  • 206
  • 364
0
function getReviews(){

    $url = 'http://www.tripadvisor.com/Hotel_Review-g274965-d952833-Reviews-Ezera_Maja-Liepaja_Kurzeme_Region.html';
    $html = new simple_html_dom();
    $html = file_get_html($url);
    $array = array();
    $i = 0;

   // IMG ID
    foreach($html->find('div[class=avatar] a img') as $element) {  $array[$i]['id']  = $element->getAttribute('id'); $i++;} unset($i);$i = 0;

    // IMG SRC
    $p1 = strpos( $html, 'var lazyImgs =' ) + 14;
    $p2 = strpos( $html, ']', $p1  );
    $raw = substr( $html, $p1, $p2 - $p1 ) . ']';
    $images = json_decode($raw);

    foreach ($images as $image){

        $id     = $image->id;
        $data   = $image->data;
        foreach ($array as $element){
            if ( isset($element['id']) && $element['id'] == $id){
                $array[$i]['image'] = $data;
                $i++;    
            }
        }
    }

    $html->clear();
    unset($html);
    return $array;
}

Get IMG ID in array. Then scrach var Lazyload in json and decode. Then compare 2 arrays and if id mach add data to array. Thanks to everybody!

Kārlis Millers
  • 664
  • 2
  • 11
  • 29