34

Is it possible to access my django models inside of a Scrapy pipeline, so that I can save my scraped data straight to my model?

I've seen this, but I don't really get how to set it up?

imns
  • 4,996
  • 11
  • 57
  • 80
  • http://stackoverflow.com/questions/937742/use-django-orm-as-standalone – S.Lott Nov 24 '10 at 22:20
  • possible duplicate of [Use only some parts of Django?](http://stackoverflow.com/questions/302651/use-only-some-parts-of-django) – S.Lott Nov 24 '10 at 22:21
  • 3
    That's not really what I am looking for because I already am using django. I don't want to just use the ORM. I also don't want to have to maintain to separate settings files. – imns Nov 24 '10 at 22:36
  • You want to use one part of Django: the ORM. That's a common question. Please search. The Django site referenced in that question has the specific ways to use the ORM separately without extra settings. Please actually read the question, the answers and follow the links. This is a common question. It's been answered. – S.Lott Nov 25 '10 at 03:50
  • 4
    Sorry S. Lott, this is not the same question. – bdd Apr 23 '11 at 17:01

8 Answers8

26

If anyone else is having the same problem, this is how I solved it.

I added this to my scrapy settings.py file:

def setup_django_env(path):
    import imp, os
    from django.core.management import setup_environ

    f, filename, desc = imp.find_module('settings', [path])
    project = imp.load_module('settings', f, filename, desc)       

    setup_environ(project)

setup_django_env('/path/to/django/project/')

Note: the path above is to your django project folder, not the settings.py file.

Now you will have full access to your django models inside of your scrapy project.

imns
  • 4,996
  • 11
  • 57
  • 80
  • Here's a related answer that includes pipeline.py code: http://stackoverflow.com/questions/7883196/saving-django-model-from-scrapy-project – Lionel Nov 27 '11 at 16:44
  • 4
    Just a small note. With the new project layout in Django 1.4, the path should be `setup_django_env('/path/to/django/project/project/')` – samwize Jul 26 '12 at 17:01
  • This solution was working great for me until I tried to deploy using scrapyd. When the scrapyd automatically builds the egg, it seems to be missing the Django code in package. I get this: Deploying my_scraper-1354463004 to http://localhost:6800/addversion.json Server response (200): {"status": "error", "message": "ImportError: No module named settings"} -- any advice how to handle this? It would seem that if I could get the Django code into the egg I'd be okay, but I'm not really clear on how to do that. – Mark Chackerian Dec 02 '12 at 15:49
  • I'm trying to use your solution but it's giving me an `Import Error`. Any chance you'd be willing to look at my [question] (http://stackoverflow.com/questions/14686223/scrapy-project-cant-find-django-core-management) and offer advice? – GChorn Feb 05 '13 at 04:29
  • bababa can you tell me how to i solve this error i followed you above mentioned steps `sudo scrapy deploy default -p eScraper Building egg of eScraper-1370604165 'build/scripts-2.7' does not exist -- can't clean it zip_safe flag not set; analyzing archive contents... Deploying eScraper-1370604165 to http://localhost:6800/addversion.json Server response (200): {"status": "error", "message": "ImportError: Error loading object 'eScraper.pipelines.EscraperPipeline': No module named eScraperInterfaceApp.models"}` – Vaibhav Jain Jun 07 '13 at 11:25
  • 2
    in current django version `from django.core.management import setup_environ` will not work...then how to do this – Zubair Alam May 29 '14 at 08:46
21

The opposite solution (setup scrapy in a django management command):

# -*- coding: utf-8 -*-
# myapp/management/commands/scrapy.py 

from __future__ import absolute_import
from django.core.management.base import BaseCommand

class Command(BaseCommand):

    def run_from_argv(self, argv):
        self._argv = argv
        self.execute()

    def handle(self, *args, **options):
        from scrapy.cmdline import execute
        execute(self._argv[1:])

and in django's settings.py:

import os
os.environ['SCRAPY_SETTINGS_MODULE'] = 'scrapy_project.settings'

Then instead of scrapy foo run ./manage.py scrapy foo.

UPD: fixed the code to bypass django's options parsing.

Mikhail Korobov
  • 21,908
  • 8
  • 73
  • 65
  • @Mikhail: I'm actually trying your code snippet with Django 1.4 and Scrapy 0.14.3. Unfortunately, it does not work. For instance, if I want to execute `python manage.py scrapy list` inside the Django project folder, I always get `ImportError: No module named cmdline`. However, the module named `cmdline` does exist and the site-packages directory of my Python installation is in the PYTHONPATH as well. What am I doing wrong? Thanks in advance! – pemistahl May 10 '12 at 22:39
  • do you have scrapy in your PYTHONPATH? – Mikhail Korobov May 11 '12 at 01:43
  • Okay, I'm stupid. I solved the problem. First I thought that the line `from __future__ import absolute_import` wouldn't be necessary in Python 2.7. That's why I commented it out, but it only works with this line. Generally, I have some problems with understanding absolute and relative paths in Python. I definitely should read into this a bit more. Anyway, thanks for your help! – pemistahl May 11 '12 at 10:07
  • 1
    @Mikhail: I just realized that I cannot execute Scrapy's command line options such as `-o scraped_data.json -t json`. I know how to add options to commands in general, but how to link them to Scrapy's counterparts? – pemistahl May 12 '12 at 12:10
  • 1
    @Peter: please try the updated example. It should pass options to scrapy and not try to handle them as django's options. – Mikhail Korobov May 13 '12 at 19:27
  • @Mikhail: Awesome! I never thought that this would be so easy. I don't know why it works but it works. Thank you so much! Meanwhile, I have found another solution in [my own thread](http://stackoverflow.com/questions/10564389/django-custom-management-command-running-scrapy-how-to-include-scrapys-options) but yours is definitely the way to go. :-) – pemistahl May 13 '12 at 20:03
  • @Mikhail: This is working for me great from the command line, thanks, but I can't run the management command from inside Django. If I try >>> from django.core import management >>> management.call_command('command_name') I get AttributeError: 'Command' object has no attribute '_argv' -- Any suggestions? – Mark Chackerian Jul 13 '12 at 12:15
  • I think you can instantiate the command and run it using "run_from_argv" method: `myapp.management.commands.scrapy.Command().run_from_argv(['', 'crawl', 'dmoz'])` – Mikhail Korobov Jul 13 '12 at 12:37
  • 1
    Thanks -- that was basically it. I got this to work: `myapp.management.commands.scrapy.Command().run_from_argv(['scrapy','','crawl','dmoz'])` – Mark Chackerian Jul 17 '12 at 21:28
  • 1
    After going down this very deep rabbit hole I ultimately abandoned this approach for a number of reasons, such as 1) the inability to restart twisted, which means that only the first command would work, so trying to trigger from multiple user initiated actions is impossible. 2) having to re-write a whole bunch of scrapy to get around the problem that twisted assumes that it is started from the main thread. So I recommend that you use scrapyd instead and call as a webservice. – Mark Chackerian Aug 02 '12 at 18:11
  • @MikhailKorobov I'm currently dealing with deploying my spiders to a `scrapyd` server. However, when I execute `python manage.py scrapy server`, I get `scrapy.exceptions.NotConfigured: Unable to find scrapy.cfg file to infer project data dir`. How to resolve this? – pemistahl Aug 31 '12 at 18:25
  • @MikhailKorobov You can find a more detailed explanation of my problem [in this thread](http://stackoverflow.com/questions/12221937/cannot-import-either-scrapys-settings-module-or-its-scrapy-cfg) – pemistahl Aug 31 '12 at 20:20
  • 1
    You don't need to bypass option parsing. You just need a POSIX style delimiter. See [my answer](http://stackoverflow.com/questions/10564389/#13039421) to Peter Stahl's question. – Aryeh Leib Taurog Oct 23 '12 at 21:31
  • @MikhailKorobov How do you setup your scrapy project directory inside the django directory folders? Thanks! – pyramidface Dec 04 '14 at 21:18
16

Add DJANGO_SETTINGS_MODULE env in your scrapy project's settings.py

import os
os.environ['DJANGO_SETTINGS_MODULE'] = 'your_django_project.settings'

Now you can use DjangoItem in your scrapy project.

Edit:
You have to make sure that the your_django_project projects settings.py is available in PYTHONPATH.

Woltan
  • 13,723
  • 15
  • 78
  • 104
Jet Guo
  • 243
  • 2
  • 6
2

For Django 1.4, the project layout has changed. Instead of /myproject/settings.py, the settings module is in /myproject/myproject/settings.py.

I also added path's parent directory (/myproject) to sys.path to make it work correctly.

def setup_django_env(path):
    import imp, os, sys
    from django.core.management import setup_environ

    f, filename, desc = imp.find_module('settings', [path])
    project = imp.load_module('settings', f, filename, desc)       

    setup_environ(project)

    # Add path's parent directory to sys.path
    sys.path.append(os.path.abspath(os.path.join(path, os.path.pardir)))

setup_django_env('/path/to/django/myproject/myproject/')
samwize
  • 25,675
  • 15
  • 141
  • 186
  • 3
    Note that the usage of `setup_environ` is [deprecated](https://docs.djangoproject.com/en/1.4/releases/1.4/#django-core-management-setup-environ) starting from version 1.4. – pemistahl Sep 03 '12 at 08:56
1

Check out django-dynamic-scraper, it integrates a Scrapy spider manager into a Django site.

https://github.com/holgerd77/django-dynamic-scraper

Sectio Aurea
  • 393
  • 6
  • 10
0

Why not create a __init__.py file in the scrapy project folder and hook it up in INSTALLED_APPS? Worked for me. I was able to simply use:

piplines.py

from my_app.models import MyModel

Hope that helps.

Community
  • 1
  • 1
Özer
  • 2,059
  • 18
  • 22
0

setup-environ is deprecated. You may need to do the following in scrapy's settings file for newer versions of django 1.4+

def setup_django_env():
    import sys, os, django

    sys.path.append('/path/to/django/myapp')
    os.environ['DJANGO_SETTINGS_MODULE'] = 'myapp.settings'

django.setup()
Brayoni
  • 15
  • 1
  • 6
0

Minor update to solve KeyError. Python(3)/Django(1.10)/Scrapy(1.2.0)

from django.core.management.base import BaseCommand

class Command(BaseCommand):    
    help = 'Scrapy commands. Accessible from: "Django manage.py". '

    def __init__(self, stdout=None, stderr=None, no_color=False):
        super().__init__(stdout=None, stderr=None, no_color=False)

        # Optional attribute declaration.
        self.no_color = no_color
        self.stderr = stderr
        self.stdout = stdout

        # Actual declaration of CLI command
        self._argv = None

    def run_from_argv(self, argv):
        self._argv = argv
        self.execute(stdout=None, stderr=None, no_color=False)

    def handle(self, *args, **options):
        from scrapy.cmdline import execute
        execute(self._argv[1:])

The SCRAPY_SETTINGS_MODULE declaration is still required.

os.environ.setdefault('SCRAPY_SETTINGS_MODULE', 'scrapy_project.settings')
Siggy
  • 11
  • 3