scrapy xpath select elements by classname

Question

I have followed How can I find an element by CSS class with XPath? which gives the selector to use for selecting elements by classname. The problem is when I use it it retrieves an empty result "[]" and I know by fact there is a div classed "zoomWindow" in the url fed to the scrapy shell.

enter image description here

My attempt:

scrapy shell "http://www.niceicdirect.com/epages/NICShop.sf/secAlIVFGjzzf2/?ObjectPath=/Shops/NICShop/Products/5696"
response.xpath("//*[contains(@class, 'zoomWindow')]")

I have looked at many resources that provide varied selectors. In my case the element only has one class, so versions that use "concat" I used but didn't work and discarded.

I have installed ubuntu and scrapy in a virtual machine just to make sure it was not a bug in my installation on windows but my attempt on ubuntu had the same results.

I don't know what else to try, can you see any typo in the selector?

alecxe · Accepted Answer · 2015-01-27T16:28:26.573

5

If you would check the response.body in the shell - you would see that it doesn't contain an element with class="zoomWindow":

In [3]: "zoomWindow" in response.body
Out[3]: False

But, if you open the page in the browser and inspect the HTML source, you would see that the element is there. This means that the page load involves javascript logic or additional AJAX requests. Scrapy is not a browser and doesn't have a javascript engine built-in. In other words, it only downloads the initial HTML code of the page without additionally downloading js and css files and "executing" them.

What you can try, for starters, is to use scrapyjs download handler and middleware.

To image you want to extract is also available in the img tag with id="PreviewImage":

In [4]: response.xpath("//img[@id='PreviewImage']/@src").extract()
Out[4]: [u'/WebRoot/NICEIC/Shops/NICShop/547F/0D9A/F434/5E4C/0759/0A0A/124C/58F7/5708.png']

edited Jan 27 '15 at 16:28

answered Jan 27 '15 at 16:12

alecxe

462,703
120
1,088
1,195

does that mean that after the using scrapy shell "url" not all the webpage content is made available on response.body? See in screenshot class appearance on the webpage. – secuaz Jan 27 '15 at 16:18
@secuaz you can say so, yes. I've updated the answer extending the explanation a bit. Btw, why do you need this element, what kind of data do you want to get from this element? – alecxe Jan 27 '15 at 16:21
Need the image, which is applied to the target element via css,background-image property. – secuaz Jan 27 '15 at 16:24
@secuaz updated the answer, is this what you want to extract? Thanks. – alecxe Jan 27 '15 at 16:28
Yes!! But how did you know that id? I thought that if an image was added via css that wouldn't generate an img element. – secuaz Jan 27 '15 at 16:34
1

@secuaz I've just dumped the `response.body` to a local HTML file and searched for the background image path used inside the `div` with `class="zoomWindow"` - got lucky to have it inside a separate `img` tag. Hope that makes sense. – alecxe Jan 27 '15 at 16:37
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/69679/discussion-between-secuaz-and-alecxe). – secuaz Jan 27 '15 at 16:41
To dump the response.body to a local html file use: open('index.html', 'w') as f: f.write(response.body.encode('utf-8')) – secuaz Jan 27 '15 at 16:49

scrapy xpath select elements by classname

1 Answers1