Elements Inside Opening Tag

Question

I am writing a spider to download all images on the front page of a subreddit using scrapy. To do so, I have to find the image links to download the images from and use a CSS or XPath selector.

Upon inspection, the links are provided but the HTML looks like this for all of them:

<div class="expando expando-uninitialized" style="display: none" data-cachedhtml=" <div class="media-preview" id="media-preview-7lp06p" style="max-width: 861px"> <div class="media-preview-content"> <a href="https://i.redd.it/29moua43so501.jpg" class="may-blank"> <img class="preview" src="https://i.redditmedia.com/Q-LKAeFelFa9wAdrnvuwCMyXLrs0ULUKMsJTXSf3y34.jpg?w=861&amp;s=69085fb507bed30f1e4228e83e24b6b2" width="861" height="638"> </a> </div> </div> " data-pin-condition="function() {return this.style.display != 'none';}"><span class="error">loading...</span></div>

From what I can tell, it looks like all of the new elements are being initialized inside the opening tag of the <div> element. Could you explain what exactly is going on here, and how one would go about extracting image information from this?

*Sorry, I'm not quite sure how to properly format the html code, but there really isn't all too much to format, as it is all one big tag anyway.

Well, the HTML is faulty, that's for sure. But I'm not sure if everything following `data-cachedhtml` is supposed to be the value of this attribute (in which case, the `"` quotes inside should be escaped, up until the `

`) or if there's a something missing like `">` right before the ` — Mr Lister, Dec 24 '17 at 13:00

kjhughes · Answer 1 · 2017-12-24T14:48:22.030

How to read the mangled attribute, `data-cachedhtml`

The HTML is a mess. Try the techniques listed in How to parse invalid (bad / not well-formed) XML? to get viable markup before using XPath. It may take three passes:

Cleanup the markup mess.
Get the attribute value of data-cachedhtml.
Use XPath to extract the image links.

XPath part

For the de-mangled data-chachedhtml in this form:

<div class="media-preview" id="media-preview-7lp06p" style="max-width: 861px">
  <div class="media-preview-content">
    <a href="https://i.redd.it/29moua43so501.jpg" class="may-blank">
      <img class="preview" src="https://i.redditmedia.com/elided"
           width="861" height="638"/>
    </a>
  </div>
  <span class="error">loading...</span>
</div>

This XPath will retrieve the preview image links:
```
//a/img/@src
```
(That is, all src attributes of img element children of a elements.)

or

This XPath will retrieve the click-through image links:
```
//a[img]/@href
```
(That is, all href attributes of the a elements that have a img child.)

Elements Inside Opening Tag

1 Answers1

How to read the mangled attribute, data-cachedhtml

XPath part

How to read the mangled attribute, `data-cachedhtml`