Debugging Scrapy Project in Visual Studio Code

Question

I have Visual Studio Code on a Windows Machine, on which I am making a new Scrapy Crawler. The crawler is working fine but I want to debug the code, for which I am adding this in my launch.json file:

{
    "name": "Scrapy with Integrated Terminal/Console",
    "type": "python",
    "request": "launch",
    "stopOnEntry": true,
    "pythonPath": "${config:python.pythonPath}",
    "program": "C:/Users/neo/.virtualenvs/Gers-Crawler-77pVkqzP/Scripts/scrapy.exe",
    "cwd": "${workspaceRoot}",
    "args": [
        "crawl",
        "amazon",
        "-o",
        "amazon.json"
    ],
    "console": "integratedTerminal",
    "env": {},
    "envFile": "${workspaceRoot}/.env",
    "debugOptions": [
        "RedirectOutput"
    ]
}

But I am unable to hit any breakpoints. PS: I took the JSON script from here: http://www.stevetrefethen.com/blog/debugging-a-python-scrapy-project-in-vscode

Possible duplicate of [How to use PyCharm to debug Scrapy projects](https://stackoverflow.com/questions/21788939/how-to-use-pycharm-to-debug-scrapy-projects) — Lore, Mar 10 '18 at 09:10

cpinamtz · Accepted Answer · 2021-10-19T16:21:10.973

44

In order to execute the typical scrapy runspider <PYTHON_FILE> command you must to set the following config into your launch.json:

{
    "version": "0.1.0",
    "configurations": [
        {
            "name": "Python: Launch Scrapy Spider",
            "type": "python",
            "request": "launch",
            "module": "scrapy",
            "args": [
                "runspider",
                "${file}"
            ],
            "console": "integratedTerminal"
        }
    ]
}

Set the breakpoints wherever you want and then debug.

edited Oct 19 '21 at 16:21

answered Mar 14 '20 at 19:34

cpinamtz

773
8
13

3

This should be the accepted answer. Adding a scrapy-specific configuration in the working-project's `launch.json` is a good practice, easy to implement and doesn't require an additional script to be created. – Andreas L. Dec 01 '20 at 16:56
@cpinamtz This doesn't seem to work for me for whatever reason. I use a very similar set up, the program runs, but doesn't stop at the break points. – Asif Dec 05 '22 at 01:56
@Asif in case you're using multiple spiders, check you're executing the one you expect to debug – cpinamtz Dec 05 '22 at 13:18

fmagno · Answer 2 · 2019-06-29T19:12:35.470

Inside your scrapy project folder create a runner.py module with the following:

import os
from scrapy.cmdline import execute

os.chdir(os.path.dirname(os.path.realpath(__file__)))

try:
    execute(
        [
            'scrapy',
            'crawl',
            'SPIDER NAME',
            '-o',
            'out.json',
        ]
    )
except SystemExit:
    pass

Place a breakpoint in the line you wish to debug
Run runner.py with vscode debugger

score 9 · Answer 3 · answered Dec 19 '19 at 19:35

Configure your json file like that:

"version": "0.2.0",
"configurations": [
    {
        "name": "Crawl with scrapy",
        "type": "python",
        "request": "launch",
        "module": "scrapy",
        "cwd": "${fileDirname}",
        "args": [
            "crawl",
            "<SPIDER NAME>"
        ],
        "console": "internalConsole"
    }
]

Click on the tab in VSCode corresponding to your spider, then launch a debug session corresponding to the json file.

this one is the most updated answer as it uses the newest scrapy crawl command. — Kenny Aires, Oct 11 '21 at 19:11

score 5 · Answer 4 · answered Jul 08 '20 at 12:19

You could also try with

{
  "configurations": [
    {
        "name": "Python: Scrapy",
        "type": "python",
        "request": "launch",
        "module": "scrapy",
        "cwd": "${fileDirname}",
        "args": [
            "crawl",
            "${fileBasenameNoExtension}",
            "--loglevel=ERROR"
        ],
        "console": "integratedTerminal",
        "justMyCode": false
    }
  ]
}

but the name of the field should be the same than the spiders name.

The --loglevel=ERROR is to get an output less verbose ;)

To make `crawl` works with `${fileBasenameNoExtension}`, make sure that spider class `name` attribute has the same value as script file basename. — omegastripes, May 22 '22 at 21:50

score 3 · Answer 5 · answered Mar 09 '18 at 21:07

I made it. The simplest way is to make a runner script runner.py

import scrapy
from scrapy.crawler import CrawlerProcess

from g4gscraper.spiders.g4gcrawler import G4GSpider

process = CrawlerProcess({
    'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)',
    'FEED_FORMAT': 'json',
    'FEED_URI': 'data.json'
})

process.crawl(G4GSpider)
process.start() # the script will block here until the crawling is finished

Then I added breakpoints in the spider while I launched debugger on this file. Reference: https://doc.scrapy.org/en/latest/topics/practices.html

score 2 · Answer 6 · answered Jul 14 '20 at 15:15

I applied @fmango's code and improve it.

Instead of write a seperated runner file, just paste these code lines at the end of spider.
run python debugger. that's all

if __name__ == '__main__':
    import os
    from scrapy.cmdline import execute

    os.chdir(os.path.dirname(os.path.realpath(__file__)))

    SPIDER_NAME = MySpider.name
    try:
        execute(
            [
                'scrapy',
                'crawl',
                SPIDER_NAME,
                '-s',
                'FEED_EXPORT_ENCODING=utf-8',
            ]
        )
    except SystemExit:
        pass

score 1 · Answer 7 · answered Feb 06 '19 at 01:04

Don't need to modify the launch.json, the default "Python: Current File (Integrated Terminal)" works perfectly. For the Python3 project, remember to place the runner.py file at the same level as the scrapy.cfg file (which is the project root).

The runner.py code as @naqushab does above. Note the processs.crawl(className), where the className is the spiders class that you want to set the breakpoint at.

Debugging Scrapy Project in Visual Studio Code

7 Answers7

Linked