Extracting .jpg from xpath

Question

I am trying to extract links of .jpg images from the following link: https://asheville.craigslist.org/search/sss

If you look nested in the nodes, there are nodes with the links I need to extract.

I am new to scrapy and xpath and I can't seem to get anything to return other than an empty list.

I've tried many varieties of this code without any luck:

response.xpath('//*[@id="sortable-results"]/ul/li/a/img/')

see above for the code I've been trying. thanks! – Keenan Burke-Pitts Apr 20 '17 at 14:18 — Keenan Burke-Pitts, Apr 20 '17 at 14:18

score 0 · Answer 1 · answered Apr 20 '17 at 06:08

It seems like the data is hidden in <a> nodes data-ids attribute and later unpacked by javascript into a gallery of images.

<a href="/cto/6095960745.html" class="result-image gallery" 
data-ids="1:01414_7WJQELsYuex,1:00t0t_kxF99J8uXmP,1:00S0S_dgnLA6FvDKX,1:00404_kTP1mB2Flpb,1:00P0P_j5On1SCHLuP,1:00a0a_jZYNazvdTgo,1:00Y0Y_9HJf6UJJVg7,1:00p0p_loCrLMXpS5s,1:00k0k_3e296xxBfXi,1:00f0f_5QpRYaBnIK7,1:00e0e_aZTOihYtz9C,1:00c0c_iatoB70CmWg,1:00X0X_dwt0ZbxYJNC,1:00k0k_k3dPBZpN9KM,1:00W0W_f51jQcPO86R">\n
<span class="result-price">$1700</span>\n        </a>

We can reverse engineer this by extracting the ids and then formatting our own image urls:

ids = response.xpath("//a[@class='result-image gallery']/@data-ids").extract()
ids = ''.join(ids).split(',')  # all of ids are separeted by comma
template = "https://images.craigslist.org/{}_300x300.jpg"
for img_id in ids:
    # e.g. 1:00G0G_anZn4IdI4pK'
    # we want to get rid of 1: part
    img_id = img_id.split(':')[-1] 
    url = template.format(image id)
    print(url)

Thanks for the response. I need to extract the .jpg hyperlink that is contained within the node that is nested within the node. — Keenan Burke-Pitts, Apr 20 '17 at 14:09

score 0 · Accepted Answer · answered Apr 20 '17 at 14:24

0

Try to implement below XPath expression to get image source links:

//div[@id="sortable-results"]//img/@src

answered Apr 20 '17 at 14:24

Andersson

51,635
17
77
129

still returns an empty list when I use response.xpath('//div[@id="sortable-results"]//img/@src') – Keenan Burke-Pitts Apr 20 '17 at 14:32
This is because required content is dynamic- it is generated by `JavaScript`... but `XPath` is correct :) – Andersson Apr 20 '17 at 14:34
Thanks for the clarification! – Keenan Burke-Pitts Apr 20 '17 at 14:37
Welcome. You might need to check [this](http://stackoverflow.com/questions/30345623/scraping-dynamic-content-using-python-scrapy) – Andersson Apr 20 '17 at 14:39
Thanks! Also for anyone else stuck on this issue this may also be useful: https://blog.scrapinghub.com/2015/03/02/handling-javascript-in-scrapy-with-splash – Keenan Burke-Pitts Apr 20 '17 at 14:43

Extracting .jpg from xpath

2 Answers2