2

Now I wish to scrape the all the images of the items (iphone) in this web page. First I extract all the links of the image, and then send a request one by one to the src and download them to the folder '/phone/'. Here is my code:

from pyspider.libs.base_handler import *


class Handler(BaseHandler):
    crawl_config = {
    }

    @every(minutes=24 * 60)
    def on_start(self):
        print 'hi'
        self.crawl('https://s.taobao.com/search?q=iphone&imgfile=&ie=utf8', callback=self.index_page, fetch_type='js')

    #@config(age=10 * 24 * 60 * 60)
    def index_page(self, response):
        items = response.doc('.item').items()
        for item in items:
            imgurl = item('.J_ItemPic img').attr('.src')
            if imgurl:
                filename = item('.J_ItemPic.img').attr('.id')
                self.crawl(imgurl, callback=self.scrape_photo, save={'filename': filename})

    def save_photo(self, content, filename):
        with open('phone/'+filename, 'wb') as f:
            f.write(content)

    def scrape_photo(self, response):
        content = response.content
        filename = response.save['filename']+'.jpg'
        self.save_photo(content, filename)

It's quite intuitive and simple. But when I run the code, nothing happened and I just got this log messages in the terminal:

[I 160602 18:57:42 scheduler:664] restart task sk:on_start data:,on_start
[I 160602 18:57:42 scheduler:771] select sk:on_start data:,on_start
[I 160602 18:57:42 tornado_fetcher:178] [200] sk:on_start data:,on_start 0s
[I 160602 18:57:42 processor:199] process sk:on_start data:,on_start -> [200] len:8 -> result:None fol:1 msg:0 err:None
[I 160602 18:57:42 scheduler:712] task done sk:on_start data:,on_start

I am nearly crazy about this issue. Could you please tell me what is the problem and how can I fix it? Thanks in advance!

u3728666
  • 99
  • 2
  • 9
  • That's no easy for scraping `taobao.com`. They have a team to anti scrape robot.. – Sayakiss Jun 02 '16 at 11:36
  • @Sayakiss But there are lots of tutorials talking about scrape taobao, though they are not what I want... – u3728666 Jun 02 '16 at 11:46
  • You may try to reproduce your problem in a simple website(you may run a HTTP server in localhost and scrape it). Indeed, if a request to `taobao.com` is not from PRC, it will redirect to `world.taobao.com`... I'm afraid it's hard to reproduce your problem and locate it easily... – Sayakiss Jun 02 '16 at 14:43
  • @Sayakiss Thanks for your reply. I will do it. But have you tried my code? I think I have written something wrong but I can't find it out. – u3728666 Jun 02 '16 at 14:59

1 Answers1

1

Did you ever crawled the link 'https://s.taobao.com/search?q=iphone&imgfile=&ie=utf8' before?

pyspider will discard the crawled links by default (your commented @config(age=10 * 24 * 60 * 60) means never recrawl)

If you want to restart the hold project http://docs.pyspider.org/en/latest/apis/self.crawl/#itag will help.

Binux
  • 697
  • 6
  • 12