0

I am writing a spider to download all images on the front page of a subreddit using scrapy. To do so, I have to find the image links to download the images from and use a CSS or XPath selector.

Upon inspection, the links are provided but the HTML looks like this for all of them:

<div class="expando expando-uninitialized" style="display: none" data-cachedhtml=" <div class="media-preview" id="media-preview-7lp06p" style="max-width: 861px"> <div class="media-preview-content"> <a href="https://i.redd.it/29moua43so501.jpg" class="may-blank"> <img class="preview" src="https://i.redditmedia.com/Q-LKAeFelFa9wAdrnvuwCMyXLrs0ULUKMsJTXSf3y34.jpg?w=861&amp;s=69085fb507bed30f1e4228e83e24b6b2" width="861" height="638"> </a> </div> </div> " data-pin-condition="function() {return this.style.display != 'none';}"><span class="error">loading...</span></div>

From what I can tell, it looks like all of the new elements are being initialized inside the opening tag of the <div> element. Could you explain what exactly is going on here, and how one would go about extracting image information from this?

*Sorry, I'm not quite sure how to properly format the html code, but there really isn't all too much to format, as it is all one big tag anyway.

Mr Lister
  • 45,515
  • 15
  • 108
  • 150
  • Well, the HTML is faulty, that's for sure. But I'm not sure if everything following `data-cachedhtml` is supposed to be the value of this attribute (in which case, the `"` quotes inside should be escaped, up until the `
    `) or if there's a something missing like `">` right before the `
    – Mr Lister Dec 24 '17 at 13:00

1 Answers1

1

How to read the mangled attribute, data-cachedhtml

The HTML is a mess. Try the techniques listed in How to parse invalid (bad / not well-formed) XML? to get viable markup before using XPath. It may take three passes:

  1. Cleanup the markup mess.
  2. Get the attribute value of data-cachedhtml.
  3. Use XPath to extract the image links.

XPath part

For the de-mangled data-chachedhtml in this form:

<div class="media-preview" id="media-preview-7lp06p" style="max-width: 861px">
  <div class="media-preview-content">
    <a href="https://i.redd.it/29moua43so501.jpg" class="may-blank">
      <img class="preview" src="https://i.redditmedia.com/elided"
           width="861" height="638"/>
    </a>
  </div>
  <span class="error">loading...</span>
</div>
  1. This XPath will retrieve the preview image links:

    //a/img/@src
    

    (That is, all src attributes of img element children of a elements.)

or

  1. This XPath will retrieve the click-through image links:

    //a[img]/@href
    

    (That is, all href attributes of the a elements that have a img child.)


kjhughes
  • 106,133
  • 27
  • 181
  • 240