Items vs item loaders in scrapy

Question

I'm pretty new to scrapy, I know that items are used to populate scraped data, but I cant understand the difference between items and item loaders. I tried to read some example codes, they used item loaders to store instead of items and I can't understand why. Scrapy documentation wasn't clear enough for me. Can anyone give a simple explanation (better with example) about when item loaders are used and what additional facilities do they provide over items ?

Granitosaurus · Accepted Answer · 2016-08-25T12:36:00.817

I really like the official explanation in the docs:

Item Loaders provide a convenient mechanism for populating scraped Items. Even though Items can be populated using their own dictionary-like API, Item Loaders provide a much more convenient API for populating them from a scraping process, by automating some common tasks like parsing the raw extracted data before assigning it.

In other words, Items provide the container of scraped data, while Item Loaders provide the mechanism for populating that container.

Last paragraph should answer your question.
Item loaders are great since they allow you to have so many processing shortcuts and reuse a bunch of code to keep everything tidy, clean and understandable.

Comparison example case. Lets say we want to scrape this item:

class MyItem(Item):
    full_name = Field()
    bio = Field()
    age = Field()
    weight = Field()
    height = Field()

Item only approach would look something like this:

def parse(self, response):
    full_name = response.xpath("//div[contains(@class,'name')]/text()").extract()
    # i.e. returns ugly ['John\n', '\n\t  ', '  Snow']
    item['full_name'] = ' '.join(i.strip() for i in full_name if i.strip())
    bio = response.xpath("//div[contains(@class,'bio')]/text()").extract()
    item['bio'] = ' '.join(i.strip() for i in full_name if i.strip())
    age = response.xpath("//div[@class='age']/text()").extract_first(0)
    item['age'] = int(age) 
    weight = response.xpath("//div[@class='weight']/text()").extract_first(0)
    item['weight'] = int(age) 
    height = response.xpath("//div[@class='height']/text()").extract_first(0)
    item['height'] = int(age) 
    return item

vs Item Loaders approach:

# define once in items.py 
from scrapy.loader.processors import Compose, MapCompose, Join, TakeFirst
clean_text = Compose(MapCompose(lambda v: v.strip()), Join())   
to_int = Compose(TakeFirst(), int)

class MyItemLoader(ItemLoader):
    default_item_class = MyItem
    full_name_out = clean_text
    bio_out = clean_text
    age_out = to_int
    weight_out = to_int
    height_out = to_int

# parse as many different places and times as you want  
def parse(self, response):
    loader = MyItemLoader(selector=response)
    loader.add_xpath('full_name', "//div[contains(@class,'name')]/text()")
    loader.add_xpath('bio', "//div[contains(@class,'bio')]/text()")
    loader.add_xpath('age', "//div[@class='age']/text()")
    loader.add_xpath('weight', "//div[@class='weight']/text()")
    loader.add_xpath('height', "//div[@class='height']/text()")
    return loader.load_item()

As you can see the Item Loader is so much cleaner and easier to scale. Let's say you have 20 more fields from which a lot share the same processing logic, would be a suicide to do it without Item Loaders. Item Loaders are awesome and you should use them!

The item example creates a lot of unnecessary variables which makes it look a lot more cluttered, `item["bio"] = response.xpath("//div[contains(@class,'bio')]/text()").extract()` — Padraic Cunningham, Aug 24 '16 at 22:01
@PadraicCunningham I don't see any unnecessary variables here since `bio` field has to be striped and joined. Your example would just put a list of values with no clean up. — Granitosaurus, Aug 25 '16 at 07:08
I might be wrong but I feel that `ItemLoader`s added value lies in using small and shared (repetitive) functions among many projects, other than that I don't see why to use them — , Jul 17 '17 at 19:26
Thanks for this! Can I just confirm if you put the class `MyItemLoader` and the function `parse` in items.py? I thought that `parse` would go in the spider? — Maverick, Sep 28 '17 at 08:56
@Maverick parse goes in your spider. I was just saving space. — Granitosaurus, Sep 28 '17 at 09:29
Thanks for clarifying, been trying to get this straight in my mind ad was a little thrown off by the thought they might be both stored in items. One more thing I'm trying to understand, why does `default_item_class` equal `MyItem`? — Maverick, Sep 28 '17 at 09:39
@Maverick Itemloader needs a class for items it generates, `default_item_class` class variable specifies default class that will be used, otherwise you'd need to supply `item` arguemnt everytime you initiate your loader. — Granitosaurus, Sep 28 '17 at 12:19
@Granitosaurus Can you please elaborate, "parse as many different places and times as you want " ? — yajant b, Jul 05 '19 at 09:36

Items vs item loaders in scrapy

1 Answers1

Linked