9

I've been working with Scrapy but run into a bit of a problem.

DjangoItem has a save method to persist items using the Django ORM. This is great, except that if I run a scraper multiple times, new items will be created in the database even though I may just want to update a previous value.

After looking at the documentation and source code, I don't see any means to update existing items.

I know that I could call out to the ORM to see if an item exists and update it, but it would mean calling out to the database for every single object and then again to save the item.

How can I update items if they already exist?

NT3RP
  • 15,262
  • 9
  • 61
  • 97
  • if your application needs to check for duplicate object exists, it has to check against the already created objects. If your db structure could support a unique column, you could attempt the write then if there is an Integrity Error against the unique key, you could update – dm03514 May 14 '14 at 19:50
  • If no one else writes to your database, you could query the database at startup (in `__init__` of your spider or catching `spider_opened` in some middleware for example) and keep a list of IDs or list of tuples representing your db items. then when you have a item to save or update, you check against the list to know what operation to perform – paul trmbrth May 15 '14 at 09:52
  • paultrmbrth: That sounds like it might work. I had been looking at [this bit](http://doc.scrapy.org/en/latest/topics/item-pipeline.html#duplicates-filter) on detecting duplicates in the pipeline, but unfortunately it only covers duplicates from within the current scrape. dm03514: That could work as well, but the challenge then is querying the database for duplicates. – NT3RP May 15 '14 at 13:08
  • I'm just want to learn how to do it, ie. how to run spider(s) in some interval again and decide if I should insert or update. But when I read the previous here, I think I will add a field into database for some md5 hash. Then for a scrapped row I will split its values into 2 categories: crucial and non-crucial. From the crucial values I will calculate the md5 and decide if I will update existing row or if I must/can insert a new row (and later delete the non-affected rows). – mirek Sep 08 '21 at 12:55

3 Answers3

11

Unfortunately, the best way that I found to accomplish this is to do exactly what was stated: Check if the item exists in the database using django_model.objects.get, then update it if it does.

In my settings file, I added the new pipeline:

ITEM_PIPELINES = {
    # ...
    # Last pipeline, because further changes won't be saved.
    'apps.scrapy.pipelines.ItemPersistencePipeline': 999
}

I created some helper methods to handle the work of creating the item model, and creating a new one if necessary:

def item_to_model(item):
    model_class = getattr(item, 'django_model')
    if not model_class:
        raise TypeError("Item is not a `DjangoItem` or is misconfigured")

    return item.instance


def get_or_create(model):
    model_class = type(model)
    created = False

    # Normally, we would use `get_or_create`. However, `get_or_create` would
    # match all properties of an object (i.e. create a new object
    # anytime it changed) rather than update an existing object.
    #
    # Instead, we do the two steps separately
    try:
        # We have no unique identifier at the moment; use the name for now.
        obj = model_class.objects.get(name=model.name)
    except model_class.DoesNotExist:
        created = True
        obj = model  # DjangoItem created a model for us.

    return (obj, created)


def update_model(destination, source, commit=True):
    pk = destination.pk

    source_dict = model_to_dict(source)
    for (key, value) in source_dict.items():
        setattr(destination, key, value)

    setattr(destination, 'pk', pk)

    if commit:
        destination.save()

    return destination

Then, the final pipeline is fairly straightforward:

class ItemPersistencePipeline(object):
    def process_item(self, item, spider):
        try:
             item_model = item_to_model(item)
        except TypeError:
            return item

        model, created = get_or_create(item_model)

        update_model(model, item_model)

        return item
NT3RP
  • 15,262
  • 9
  • 61
  • 97
  • Hi NT3RP and others. It looks well. In my case it will be a little more difficult because I want to normalize a scrapped values into more related tables. From the moment when I started to use scrapy-djangoitem and DjangoItem I am however not sure what for the DjangoItem is. Is it something like ModelSerializer in Django Rest Framework? What is its real benefit? Wouldn't it be better for me to work directly with Django ORM instances? – mirek Sep 08 '21 at 13:07
3

I think it could be done more simply with

class DjangoSavePipeline(object):
    def process_item(self, item, spider):
        try:
            product = Product.objects.get(myunique_id=item['myunique_id'])
            # Already exists, just update it
            instance = item.save(commit=False)
            instance.pk = product.pk
        except Product.DoesNotExist:
            pass
        item.save()
        return item

Assuming your django model has some unique id from the scraped data, such as a product id, and here assuming your Django model is called Product.

fpghost
  • 2,834
  • 4
  • 32
  • 61
1

for related models with foreignkeys

def update_model(destination, source, commit=True):
    pk = destination.pk

    source_fields = fields_for_model(source)
    for key in source_fields.keys():
        setattr(destination, key, getattr(source, key))

    setattr(destination, 'pk', pk)

    if commit:
        destination.save()

    return destination
  • Will you show a more detailed example how this `update_model()` function can be used in the context of a `DjangoItem` in Scrapy? – Code-Apprentice Jun 16 '21 at 15:27