How do I extract img src from HTML via lxml XPath?

Question

I'm trying to extract an image URl using python/lxml and the xpath() command, but am having trouble isolating the url itself.

Here is the HTML surrounding the img src that I want:

<div data-index="0" data-za-action="Photo Lightbox - Open" data-za-
category="Homes" class="img-wrapper za-track-event zsg-lightbox-show" 
data-target-id="hdp-photo-lightbox" data-za-label="position: 0, total: 
18, id: 10660534745" id="yui_3_18_1_2_1519884476676_1986"><img 
src="https://photos.zillowstatic.com/p_h/IS2fordnekys6d1000000000.jpg" 
onload="if (typeof ClientProfiler !== 'undefined') { 
ClientProfiler.profile('HDPFirstPhotoLoaded') }" id="X1-
IAgz3dcnekys6d1000000000_ptw8e" class="hip-photo"></div>

Specifically, I want to isolate the https://photos.zillowstatic.com/p_h/IS2fordnekys6d1000000000.jpg url.

I've tried a few approaches without success, including variations on the following:

xpath(".//img[@class='hip-photo']/@src")
xpath(".//img[@class='hip-photo']//text()")

Welcome to SO! Please take the [tour], and read [ask] and [MCVE]! Also, screenshots attract downvotes like magnets. — gsquaredxc, Mar 02 '18 at 02:02
Please include code as text in `code` sections and not as images. — zx485, Mar 02 '18 at 02:03
Are you sure the content of that page are not generated dynamically? If you are then it's ok otherwise you should use any browser simulator to read the elements and then apply the suggested xpaths or even the way you have defined should do as well. — SIM, Mar 02 '18 at 08:23

score 1 · Answer 1 · answered Mar 02 '18 at 02:13

I would try the Beautifulsoup (bs4) library. Your img tag has an id, so you could call the find function in bs4.

source_code.find('img', id=its_id)

Then get the scr from the tag.

Similar question regarding your problem

bs4 Youtube tutorial if you're new to it

Beautifulsoup is extremely easy to learn if you have never used it before so I would recommend looking into it.

Hope this helps!

kjhughes · Answer 2 · 2018-03-02T13:38:37.847

1

.// searches relative to the current node, which is unspecified in your question. If you use // it'll search the entire document. See also What is the difference between .// and //* in XPath?

If you wish to search the entire document XPath,

//img[@class="hip-photo"]/@src

will select all the src attributes of all img elements with an class attribute value of "hip-photo".

edited Mar 02 '18 at 13:38

answered Mar 02 '18 at 02:29

kjhughes

106,133
27
181
240

1

Comment taken out. Given pus one for the link and the clarity about using dot. – SIM Mar 02 '18 at 15:32

How do I extract img src from HTML via lxml XPath?

2 Answers2