1

I'm starting with Scrapy in order to automatize file downloading from websites. As a test, I want to download the jpg files from this website. My code is based on the intro tutorial and the Files and Images Pipeline tutorial on the Scrapy website.

My code is this:

In settings.py, I have added these lines:

ITEM_PIPELINES = {'scrapy.pipelines.images.ImagesPipeline': 1}

IMAGES_STORE = '/home/lucho/Scrapy/jpg/'

My items.py file is:

import scrapy

class JpgItem(scrapy.Item):
    image_urls = scrapy.Field()
    images = scrapy.Field()
    pass

My pipeline file is:

import scrapy
from scrapy.pipelines.images import ImagesPipeline
from scrapy.exceptions import DropItem

class JpgPipeline(object):
    def process_item(self, item, spider):
        return item
    def get_media_requests(self, item, info):
        for image_url in item['image_urls']:
            yield scrapy.Request(image_url)

    def item_completed(self, results, item, info):
        image_paths = [x['path'] for ok, x in results if ok]
        if not image_paths:
            raise DropItem("Item contains no images")
        item['image_paths'] = image_paths
        return item

Finally, my spider file is:

import scrapy
from .. items import JpgItem

class JpgSpider(scrapy.Spider):
    name = "jpg"
    allowed_domains = ["http://www.kevinsmedia.com"]
    start_urls = [
        "http://www.kevinsmedia.com/km/mp3z/Fluke/Risotto/"
    ]

def init_request(self):
    #"""This function is called before crawling starts."""
    return Request(url=self.login_page, callback=self.parse)

def parse(self, response):
    item = JpgItem()
    return item

(Ideally, I want to download all jpg, without specifying exact web addresses for each file needed)

The output of "scrapy crawl jpg" is:

2015-12-08 19:19:30 [scrapy] INFO: Scrapy 1.0.3.post6+g2d688cd started (bot: jpg)
2015-12-08 19:19:30 [scrapy] INFO: Optional features available: ssl, http11
2015-12-08 19:19:30 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'jpg.spiders', 'SPIDER_MODULES': ['jpg.spiders'], 'COOKIES_ENABLED': False, 'DOWNLOAD_DELAY': 3, 'BOT_NAME': 'jpg'}
2015-12-08 19:19:30 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
2015-12-08 19:19:30 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-12-08 19:19:30 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-12-08 19:19:30 [scrapy] INFO: Enabled item pipelines: ImagesPipeline
2015-12-08 19:19:30 [scrapy] INFO: Spider opened
2015-12-08 19:19:30 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-12-08 19:19:30 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2015-12-08 19:19:31 [scrapy] DEBUG: Crawled (200) <GET http://www.kevinsmedia.com/km/mp3z/Fluke/Risotto/> (referer: None)
2015-12-08 19:19:31 [scrapy] DEBUG: Scraped from <200 http://www.kevinsmedia.com/km/mp3z/Fluke/Risotto/>
{'images': []}
2015-12-08 19:19:31 [scrapy] INFO: Closing spider (finished)
2015-12-08 19:19:31 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 254,
 'downloader/request_count': 1,
 'downloader/request_method_count/GET': 1,
 'downloader/response_bytes': 2975,
 'downloader/response_count': 1,
 'downloader/response_status_count/200': 1,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2015, 12, 8, 22, 19, 31, 294139),
 'item_scraped_count': 1,
 'log_count/DEBUG': 3,
 'log_count/INFO': 7,
 'response_received_count': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2015, 12, 8, 22, 19, 30, 619918)}
2015-12-08 19:19:31 [scrapy] INFO: Spider closed (finished)

While there seems to be no error, the program is not retrieving the jpg files. In case it matters, I'm using Ubuntu.

luchonacho
  • 6,759
  • 4
  • 35
  • 52

1 Answers1

0

You haven't defined parse() in your JpgSpider class.

Update. This doesn't look like a problem you should be attacking with scrapy now that I can see the URL in your update. WGET might be more appropriate, have a look at the answers here. In particular, look at the first comment to the top answer to see how to use file extension to limit which files you download (-A jpg).

Update 2: The parse() routine can get the album art URLs from the <a> tag using this code

part_urls = response.xpath('//a[contains(., "AlbumArt")]/@href')

This returns a list of partial URLs, you will need to add the root URL for the page you are parsing from response.url. There are a few % codes in the URLs I've looked at, they may be a problem but try it anyway. Once you have a list of the full URLs, put them into item[]

item['image_urls'] = full_urls
yield item

This should get scrapy to automatically download the images, so you can remove your middleware and let scrapy do the heavy lifting.

Community
  • 1
  • 1
Steve
  • 976
  • 5
  • 15