1

I'm starting to work with item loaders in scrapy,and the basic functionality is working fine as in:

l.add_xpath('course_title', '//*[@class="course-header-ng__main-info__name__title"]//text()')

But if I want to apply a funtion to this item, where do I define the function?

On this question there is an example:

from scrapy.loader.processors import Compose, MapCompose, Join, TakeFirst
clean_text = Compose(MapCompose(lambda v: v.strip()), Join())   
to_int = Compose(TakeFirst(), int)

class MyItemLoader(ItemLoader):
    default_item_class = MyItem
    full_name_out = clean_text
    bio_out = clean_text
    age_out = to_int
    weight_out = to_int
    height_out = to_int

Does this goes instead of the custom template?:

import scrapy


class MoocsItem(scrapy.Item):
    # define the fields for your item here like:
    description = scrapy.Field()
    course_title = scrapy.Field()

Can I use funtions that are one liners as?

clean_text = Compose(MapCompose(lambda v: v.strip()), Join())
Luis Ramon Ramirez Rodriguez
  • 9,591
  • 27
  • 102
  • 181
  • You will assign it to the Loader you created, https://stackoverflow.com/questions/46619150/scrapy-item-loader-default-processors/46619196#46619196 – Tarun Lalwani Apr 24 '18 at 10:02

1 Answers1

3

There are two ways to use this.

Approach 1

You can change your Item class like below

class MoocsItem(scrapy.Item):
    # define the fields for your item here like:
    description = scrapy.Field()
    course_title = scrapy.Field(output_processor=clean_text)

And then you will use it like below

from scrapy.loader import ItemLoader
l = ItemLoader(item=MoocsItem(), response=response)
l.add_xpath('course_title', '//*[@class="course-header-ng__main-info__name__title"]//text()')

item = l.load_item()

This would of course be in a callback.

Approach 2

Another way to use it to create your own loader

class MoocsItemLoader(ItemLoader):
    default_item_class = MoocsItem
    course_title_name_out = clean_text

And then you will need to use loader in a callback like below

from scrapy.loader import ItemLoader
l = MoocsItemLoader(response=response)
l.add_xpath('course_title', '//*[@class="course-header-ng__main-info__name__title"]//text()')

item = l.load_item()

As you can see in this approach you don't need to pass it the created item

Tarun Lalwani
  • 142,312
  • 9
  • 204
  • 265