0

Thanks for everyone in advance. I encountered a problem when using Scrapy on Python 2.7. The webpage I tried to crawl is a discussion board for Chinese stock market. When I tried to get the first number "42177" just under the banner of this page (the number you see on that webpage may not be the number you see in the picture shown here, because it represents the number of times this article has been read and is updated realtime...), I always get an empty content. I am aware that this might be the dynamic content issue, but yet don't have a clue how to crawl it properly.

42177 is the number I tried to crawl

The code I used is:

item["read"] = info.xpath("div[@id='zwmbti']/div[@id='zwmbtilr']/span[@class='tc1']/text()").extract()

I think the xpath is set correctly and I have checked the return value of this response and it indeed told me that there is nothing under this directory. Results shown here:'read': [u'<div id="zwmbtilr"></div>']

If it has something, there should be something between <div id="zwmbtilr"> and </div>.

Really appreciated if you guys share any thoughts on this!

fbabelle
  • 83
  • 1
  • 7

1 Answers1

1

I just opened your link in Firefox with NoScript enabled. There nothing inside the <div @id='zwmbtilr'></div>. If I enable the javascripts, I can see the content you want. So, as you already new, it is a dynamic content issue.

Your first option is try to identify the request generated by javascript. If you can do that, you can send the same request from scrapy. If you can't do it, the next option is usually to use some package with javascript/browser emulation or someting like that. Something like ScrapyJS or Scrapy + Selenium.

Djunzu
  • 498
  • 2
  • 12
  • Thanks Djunzu! Can you provide any brief instruction on your first suggested option, description or links? I will love to learn the technique since this was not my first time running into this problem, to be frank... Thanks! – fbabelle Apr 24 '16 at 11:47
  • I have never had to deal with dynamic content, so I have no previous experience with it. But I would start inspecting the requests the browser do with and without javascript enabled (on Firefox you can use Firebug + NoScript; or equivalent in another browser). Also inspect the javacript source itself. If it is a simple case, you will find how to recreate the needed request. Maybe this can help: http://stackoverflow.com/questions/8550114/can-scrapy-be-used-to-scrape-dynamic-content-from-websites-that-are-using-ajax – Djunzu Apr 24 '16 at 20:41