2

Here is what I have in the Scrapy spider:

phone_model = SmartphoneItem()
phone_model['sku_number'] = sku_number

# code omitted

cellular_network = CellularNetworkItem()
cellular_network['phone_model'] = phone_model
cellular_network['speed'] = speed
...

in models.py:

class CellularNetwork(models.Model): 
    phone_model = models.ForeignKey('Smartphone', unique=True)
    ...

class Smartphone(models.Model):
    sku_number = models.IntegerField(max_length=40, primary_key=True)
    ....

and in items.py:

class CellularNetworkItem(DjangoItem):
    django_model = CellularNetwork

class SmartphoneItem(DjangoItem):
    django_model = Smartphone

But assigning phone_model = SmartphoneItem() obviously does not yield a Smartphone model.

I am scraping a bunch of specifications and would prefer to validate the data at source. Since the data has to be normalized anyway for validation, I'd prefer to kill too birds with one stone and update the database immediately.

Seemingly the relational power of ORMs over simple Scrapy items is what sells the DjangoItem. But I can't seem find any examples of exploiting this feature directly in the spider. I've seem some that use pipelines, handling objects case-by-case with isinstance pattern matching... I'm starting to wonder if I might be better off just importing Django models directly into the spider.

UPDATE: Resolved.

Based on How to update DjangoItem in Scrapy, I put this in the spider to directly return the Django model instance:

def item_to_model(self, item):
    model_class = getattr(item, 'django_model')
    if not model_class:
        raise TypeError("Item is not a `DjangoItem` or is misconfigured")

    return item.instance

UPDATE 2: Still broken:

Spoke too soon. While this provides a valid Django instance, it is not the instance. Here is the relevant part of djangoitem.py:

@property
def instance(self):
    if self._instance is None:
        modelargs = dict((k, self.get(k)) for k in self._values
                          if k in self._model_fields)
        self._instance = self.django_model(**modelargs)
    return self._instance

I don't fully understand what's going on, but I gather it gives us a fresh instance.

Community
  • 1
  • 1
Peter
  • 95
  • 1
  • 8

1 Answers1

0

Question looks old, but anyway. First, you're missing a pipeline that wil "pass" items to Django. Second, for FK, you need the instance of Django object (the proper model), not the instance of Scrapy item. So your code should look somewhat like this:

class SmartphonePipeline(object):

    def process_item(self, item, spider):
        sku_number = item['sku_number']
        smartphone, created = Smartphone.objects.get_or_create(sku_number=item['sku_number'])
        if created:
            smartphone.save()

class CellularNetworkPipeline(object):

    def process_item(self, item, spider):
        smartphone_item = item['phone_model']
        smartphone = Smartphone.objects.get(sku_number=smartphone_item["sku_number"])
        cellular_network, created = CellularNetwork.objects.get_or_create(phone_model=smartphone)
        cellular_network.speed = item['speed']
        cellular_network.save()
        return item
Leon Kladnitsky
  • 393
  • 1
  • 3
  • 7