113

I am working on Scrapy 0.20 with Python 2.7. I found PyCharm has a good Python debugger. I want to test my Scrapy spiders using it. Anyone knows how to do that please?

What I have tried

Actually I tried to run the spider as a script. As a result, I built that script. Then, I tried to add my Scrapy project to PyCharm as a model like this:
File->Setting->Project structure->Add content root.

But I don't know what else I have to do

Zephyr
  • 11,891
  • 53
  • 45
  • 80
William Kinaan
  • 28,059
  • 20
  • 85
  • 118

11 Answers11

201

The scrapy command is a python script which means you can start it from inside PyCharm.

When you examine the scrapy binary (which scrapy) you will notice that this is actually a python script:

#!/usr/bin/python

from scrapy.cmdline import execute
execute()

This means that a command like scrapy crawl IcecatCrawler can also be executed like this: python /Library/Python/2.7/site-packages/scrapy/cmdline.py crawl IcecatCrawler

Try to find the scrapy.cmdline package. In my case the location was here: /Library/Python/2.7/site-packages/scrapy/cmdline.py

Create a run/debug configuration inside PyCharm with that script as script. Fill the script parameters with the scrapy command and spider. In this case crawl IcecatCrawler.

Like this: PyCharm Run/Debug Configuration

Put your breakpoints anywhere in your crawling code and it should work™.

Pullie
  • 2,685
  • 3
  • 25
  • 31
  • (, SyntaxError("Non-ASCII character '\\xf3' in file /Library/python/2.7/site-packages/scrapy/cmdline.pyc on line 1, but no encoding declared; – Aymon Fournier Dec 06 '14 at 03:40
  • @AymonFournier: Different problem, not related to the original question. See: http://stackoverflow.com/questions/10589620/syntaxerror-non-ascii-character-xa3-in-file-when-function-returns-£ – Pullie Dec 06 '14 at 09:08
  • 1
    Great solution! I also tried using the scrapy binary itself located mostly in: /usr/bin/scrapy as the script with same parameters or any other scrapy commands you want to debug and it worked just perfect. make sure the working directory is pointing to your scrapy project root where scrapy.cfg is located. – Nour Wolf Jan 18 '15 at 02:31
  • 3
    @AymonFournier It seems you are trying to run a .pyc file. Run the corresponding .py file instead (scrapy/cmdline.py). – Artur Gaspar May 11 '15 at 22:05
  • 4
    If I'm doing that, my settings module is not found. `ImportError: No module named settings` I have checked that the working directory is the project directory. It's used within a Django project. Anyone else stumbled upon this problem? – suntoch Jan 21 '16 at 22:24
  • Using this method, the configuration for scrapy seems to be ignored. I'm not sure why. – javamonkey79 Sep 23 '16 at 01:06
  • 6
    Not forget to config `Working directory`, otherwise will error `no active project, Unknown command: crawl, Use "scrapy" to see available commands, Process finished with exit code 2` – crifan Jan 09 '18 at 12:59
  • 1
    it says : from scrapy.http.headers import Headers ImportError: cannot import name 'Headers' from partially initialized module 'scrapy.http.headers most likely due to a circular import Python38 – Amrit May 09 '21 at 15:09
122

You just need to do this.

Create a Python file on crawler folder on your project. I used main.py.

  • Project
    • Crawler
      • Crawler
        • Spiders
        • ...
      • main.py
      • scrapy.cfg

Inside your main.py put this code below.

from scrapy import cmdline    
cmdline.execute("scrapy crawl spider".split())

And you need to create a "Run Configuration" to run your main.py.

Doing this, if you put a breakpoint at your code it will stop there.

Rodrigo
  • 1,488
  • 1
  • 10
  • 8
  • 6
    You might want to configure multiple executions for different spiders, so accept spider name as an argument of your run config. Then import sys spider = sys.argv[1] cmdline.execute("scrapy crawl {}".format(spider).split()) – miguelfg Oct 01 '17 at 20:47
  • @miguelfg, can you elaborate on how to pass a spider name as an argument in run config without doing so manually each time you run the project? – NFB Jan 28 '18 at 14:38
48

As of 2018.1 this became a lot easier. You can now select Module name in your project's Run/Debug Configuration. Set this to scrapy.cmdline and the Working directory to the root dir of the scrapy project (the one with settings.py in it).

Like so:

PyCharm Scrapy debug configuration

Now you can add breakpoints to debug your code.

Rutger de Knijf
  • 1,112
  • 14
  • 23
12

I am running scrapy in a virtualenv with Python 3.5.0 and setting the "script" parameter to /path_to_project_env/env/bin/scrapy solved the issue for me.

rioted
  • 1,076
  • 13
  • 24
  • I'm suprised this works, I thought scrapy didn't work with python 3 – user1592380 May 31 '16 at 14:06
  • 1
    Thanks, this worked with Python 3.5 and virtualenv. "script" as @rioted said and setting "working directory" to `project/crawler/crawler`, i.e., the directory holding `__init__.py`. – effel Dec 07 '16 at 15:52
6

intellij idea also work.

create main.py:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
#coding=utf-8
import sys
from scrapy import cmdline
def main(name):
    if name:
        cmdline.execute(name.split())



if __name__ == '__main__':
    print('[*] beginning main thread')
    name = "scrapy crawl stack"
    #name = "scrapy crawl spa"
    main(name)
    print('[*] main thread exited')
    print('main stop====================================================')

show below:

enter image description here

enter image description here

enter image description here

LuciferJack
  • 781
  • 11
  • 12
3

I am also using PyCharm, but I am not using its built-in debugging features.

For debugging I am using ipdb. I set up a keyboard shortcut to insert import ipdb; ipdb.set_trace() on any line I want the break point to happen.

Then I can type n to execute the next statement, s to step into a function, type any object name to see its value, alter execution environment, type c to continue execution...

This is very flexible, works in environments other than PyCharm, where you don't control the execution environment.

Just type in your virtual environment pip install ipdb and place import ipdb; ipdb.set_trace() on a line where you want the execution to pause.

UPDATE

You can also pip install pdbpp and use the standard import pdb; pdb.set_trace instead of ipdb. PDB++ is nicer in my opinion.

warvariuc
  • 57,116
  • 41
  • 173
  • 227
3

To add a bit to the accepted answer, after almost an hour I found I had to select the correct Run Configuration from the dropdown list (near the center of the icon toolbar), then click the Debug button in order to get it to work. Hope this helps!

taylor
  • 1,568
  • 1
  • 10
  • 11
3

According to the documentation https://doc.scrapy.org/en/latest/topics/practices.html

import scrapy
from scrapy.crawler import CrawlerProcess

class MySpider(scrapy.Spider):
    # Your spider definition
    ...

process = CrawlerProcess({
    'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})

process.crawl(MySpider)
process.start() # the script will block here until the crawling is finished
Reuse3733
  • 150
  • 2
  • 5
2

I use this simple script:

from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings

process = CrawlerProcess(get_project_settings())

process.crawl('your_spider_name')
process.start()
gangabass
  • 10,607
  • 2
  • 23
  • 35
  • I use something similar to this called `runner.py`. The reason this is important is because it intentionally loads the project settings file. You must do this if you are trying to load pipeline(s). – Rob Mar 21 '21 at 22:05
2

Might be a bit late, but maybe it helps somebody:

Since the latest PyCharm-versions it's actually pretty straight forward, you can call Scrapy directly - see attached picture of runtime config (Scrapy tutorial).

Tested with PyCharm 2022.1.4.

enter image description here

greg
  • 212
  • 2
  • 8
0

Extending @Rodrigo's version of the answer I added this script and now I can set spider name from configuration instead of changing in the string.

import sys
from scrapy import cmdline

cmdline.execute(f"scrapy crawl {sys.argv[1]}".split())
Muhammad Haseeb
  • 634
  • 5
  • 20