3

This has been asked before but the answer that always comes up is to use DjangoItem. However it states on it's github that:

often not a good choice for a write intensive applications (such as a web crawler) ... may not scale well

This is the crux of my problem, I'd like to use and interact with my django model in the same way I can when I run python manage.py shell and I do from myapp.models import Model1. Using queries like seen here.

I have tried relative imports and moving my whole scrapy project inside my django app, both to no avail.

Where should I move my scrapy project to for this to work? How can I recreate / use all the methods that are available to me in the shell inside a scrapy pipeline?

Thanks in advance.

Max Smith
  • 925
  • 1
  • 14
  • 25
  • Were you able to figure this out ? – Bipul Jain Jan 30 '17 at 19:55
  • No I have not. It's driving me crazy. I'd really like to avoid dealing with raw SQL. The Django api is great for that! I might look into sqlalchemy but i'd rather not learn another library if I don't have too. Do you have any suggestions or possibly an approach I could look into? – Max Smith Feb 02 '17 at 21:24
  • OK I have done this before. Will write down the answer soon. It's a weekend. – Bipul Jain Feb 04 '17 at 18:17
  • Looking forward to it! Thank you. – Max Smith Feb 06 '17 at 23:41

1 Answers1

9

In here i have create a sample project which uses scrapy inside django. And uses Django models and ORM in the one of the pipelines.

https://github.com/bipul21/scrapy_django

The directory structure starts with your django project. In this case the the project name is django_project. Once inside the base project you create your scrapy project i.e. scrapy_project here

In your scrapy project settings add the following line to setup initialize django

import os
import sys
import django

sys.path.append(os.path.join(os.path.dirname(os.path.dirname(os.path.abspath(__file__))), ".."))
os.environ['DJANGO_SETTINGS_MODULE'] = 'django_project.settings'

django.setup()

In the pipeline i have made a simple query to Question Model

from questions.models import Questions

class ScrapyProjectPipeline(object):
    def process_item(self, item, spider):
        try:
            question = Questions.objects.get(identifier=item["identifier"])
            print "Question already exist"
            return item
        except Questions.DoesNotExist:
            pass

        question = Questions()
        question.identifier = item["identifier"]
        question.title = item["title"]
        question.url = item["url"]
        question.save()
        return item

You can check in the project for any further details like model schema.

Bipul Jain
  • 4,523
  • 3
  • 23
  • 26
  • 1
    This is exactly what i was looking for! thank you so much. I cant believe i couldn't find this online. – Max Smith Feb 08 '17 at 21:32
  • So if this works, what's the point of the Scrapy-DjangoItem plugin? This seems better, since I'd need to make queries to the db as well to properly update/save the scraped items – Tjorriemorrie Jan 10 '18 at 03:04
  • @Tjorriemorrie did you find anything about django-items plugin? – cikatomo Oct 27 '20 at 03:30