0

I have two kinds of Item:

class MovieItem(scrapy.Item):
    id = scrapy.Field()
    image_urls=scrapy.Field()
    image_paths =scrapy.Field()
    torrents = scrapy.Field()
    #...other fields


class TorrentItem(scrapy.Item):
    id = scrapy.Field()
    movie_id = scrapy.Field()
    image_urls=scrapy.Field()
    image_paths =scrapy.Field()

I want to use ImagePipeline and FilePipeline to download images and torrents in a movie. How should I yield the two items in the *parse* method? And how should I define the corresponding pipeline?

Brian Tompsett - 汤莱恩
  • 5,753
  • 72
  • 57
  • 129
  • In scrapy 1.0 you can return python dictionary objects. Shameless plug: You can read about the differences between Scrapy 0.24 and 1.0 on my [blog](http://kirankoduru.github.io/python/scrapy-1.0-release.html) –  Aug 30 '16 at 15:19

1 Answers1

2

The answer is yes, you can. Here's an example on how to do it. Here's an example.py spider:

# -*- coding: utf-8 -*-
import scrapy


class MovieItem(scrapy.Item):
    id = scrapy.Field()
    image_urls=scrapy.Field()
    images =scrapy.Field()
    torrents = scrapy.Field()
    itemtype = scrapy.Field()


class TorrentItem(scrapy.Item):
    id = scrapy.Field()
    movie_id = scrapy.Field()
    image_urls=scrapy.Field()
    images =scrapy.Field()


class ExampleSpider(scrapy.Spider):
    name = "example"
    allowed_domains = ["example.com"]
    start_urls = (
        'http://www.example.com/',
    )

    def parse(self, response):
        image_urls = [
            "http://...-Miles.jpg",
            "https:/.../58832_300x300",
            "http://...-Circuit-Tests.png"
        ]

        torent_ids = []
        for i in xrange(3):
            t = TorrentItem()
            t["id"] = "#id%d" % i
            t["movie_id"] = 143
            t["image_urls"] = [image_urls[i]]
            # ...
            torent_ids.append(t["id"])
            yield t

        m = MovieItem()
        m['id'] = 143
        m['image_urls'] = ['http://...test.png']
        m['torrents'] = torent_ids
        m['itemtype'] = ['movie']
        # ...
        yield m

On your settings.py add the following two lines:

ITEM_PIPELINES = {'scrapy.pipelines.images.ImagesPipeline': 1}
IMAGES_STORE = '.'

Run the spider:

scrapy crawl example -o test.jl

Your test.jl file will contain (after quite some formatting):

{
    "images": [
        {
            "url": "http://.../Stuart-Miles.jpg",
            "path": "full/27c8d5099f8785e8fbc2370249a0260e216ee2dd.jpg",
            "checksum": "dba2fc121610b328448dc37084f31dac"
        }
    ],
    "movie_id": 143,
    "id": "#id0",
    "image_urls": [
        "http://...ter-Key-by-Stuart-Miles.jpg"
    ]
}
{
    "images": [
        {
            "url": "https://i....t/58832_300x300",
            "path": "full/b11276eb5b64b5ec7f40eedf4c6fcc6d6d9072ac.jpg",
            "checksum": "a9b47ecbb2de9dcb6a61a159120f1bd2"
        }
    ],
    "movie_id": 143,
    "id": "#id1",
    "image_urls": [
        "https://i.vi..._300x300"
    ]
}
{
    "images": [
        {
            "url": "http://www.ej...rt-Circuit-Tests.png",
            "path": "full/a68282eb533d35a0aa8732a872277933db8951c5.jpg",
            "checksum": "24c0907e3ef610dc355e930f2535c0c4"
        }
    ],
    "movie_id": 143,
    "id": "#id2",
    "image_urls": [
        "http://www.ejob...nsformer-Open-and-Short-Circuit-Tests.png"
    ]
}
{
    "images": [
        {
            "url": "http://...est.png",
            "path": "full/1e3e0f775cd40aaa5ea081278957f4d49e39f610.jpg",
            "checksum": "50a57a6263b9640ee47e913deadaff7c"
        }
    ]
    "torrents": [
        "#id0",
        "#id1",
        "#id2"
    ],
    "itemtype": [
        "movie"
    ],
    "image_urls": [
        "http://xi.../10/test.png"
    ],
    "id": 143
}

This works nicely with .jl files as output. It won't work well with .csv but this shouldn't be a problem in your case.

neverlastn
  • 2,164
  • 16
  • 23
  • 1
    I have three Pipelines: `InsertDBPipeline`,which insert movie infomations into database; `ImageDownloadPipeline`,which download the images in a movie; and `TorrentsDownloadPipeline`,which recive a instance of `TorrentItem` and download it.As you say, I would yield two kinds of item,but how can pipelines treat them respectively?In other words, how to make `TorrentsDownloadPipeline` deal with `TorrentItem`, at the same time `InsertDBPipeline` and `ImageDownloadPipeline` deal with `MovieItem`? – Yancheng Zeng Aug 31 '16 at 04:57
  • You can check the type of the object on your pipelines: http://stackoverflow.com/questions/2225038/determine-the-type-of-an-object – neverlastn Aug 31 '16 at 06:33