23

I have Visual Studio Code on a Windows Machine, on which I am making a new Scrapy Crawler. The crawler is working fine but I want to debug the code, for which I am adding this in my launch.json file:

{
    "name": "Scrapy with Integrated Terminal/Console",
    "type": "python",
    "request": "launch",
    "stopOnEntry": true,
    "pythonPath": "${config:python.pythonPath}",
    "program": "C:/Users/neo/.virtualenvs/Gers-Crawler-77pVkqzP/Scripts/scrapy.exe",
    "cwd": "${workspaceRoot}",
    "args": [
        "crawl",
        "amazon",
        "-o",
        "amazon.json"
    ],
    "console": "integratedTerminal",
    "env": {},
    "envFile": "${workspaceRoot}/.env",
    "debugOptions": [
        "RedirectOutput"
    ]
}

But I am unable to hit any breakpoints. PS: I took the JSON script from here: http://www.stevetrefethen.com/blog/debugging-a-python-scrapy-project-in-vscode

naqushab
  • 754
  • 1
  • 8
  • 24
  • Possible duplicate of [How to use PyCharm to debug Scrapy projects](https://stackoverflow.com/questions/21788939/how-to-use-pycharm-to-debug-scrapy-projects) – Lore Mar 10 '18 at 09:10

7 Answers7

44

In order to execute the typical scrapy runspider <PYTHON_FILE> command you must to set the following config into your launch.json:

{
    "version": "0.1.0",
    "configurations": [
        {
            "name": "Python: Launch Scrapy Spider",
            "type": "python",
            "request": "launch",
            "module": "scrapy",
            "args": [
                "runspider",
                "${file}"
            ],
            "console": "integratedTerminal"
        }
    ]
}

Set the breakpoints wherever you want and then debug.

cpinamtz
  • 773
  • 8
  • 13
  • 3
    This should be the accepted answer. Adding a scrapy-specific configuration in the working-project's `launch.json` is a good practice, easy to implement and doesn't require an additional script to be created. – Andreas L. Dec 01 '20 at 16:56
  • @cpinamtz This doesn't seem to work for me for whatever reason. I use a very similar set up, the program runs, but doesn't stop at the break points. – Asif Dec 05 '22 at 01:56
  • @Asif in case you're using multiple spiders, check you're executing the one you expect to debug – cpinamtz Dec 05 '22 at 13:18
22
  1. Inside your scrapy project folder create a runner.py module with the following:

    import os
    from scrapy.cmdline import execute
    
    os.chdir(os.path.dirname(os.path.realpath(__file__)))
    
    try:
        execute(
            [
                'scrapy',
                'crawl',
                'SPIDER NAME',
                '-o',
                'out.json',
            ]
        )
    except SystemExit:
        pass
    
  2. Place a breakpoint in the line you wish to debug

  3. Run runner.py with vscode debugger

fmagno
  • 1,446
  • 12
  • 27
9

Configure your json file like that:

"version": "0.2.0",
"configurations": [
    {
        "name": "Crawl with scrapy",
        "type": "python",
        "request": "launch",
        "module": "scrapy",
        "cwd": "${fileDirname}",
        "args": [
            "crawl",
            "<SPIDER NAME>"
        ],
        "console": "internalConsole"
    }
]

Click on the tab in VSCode corresponding to your spider, then launch a debug session corresponding to the json file.

Manu NALEPA
  • 1,356
  • 1
  • 14
  • 23
5

You could also try with

{
  "configurations": [
    {
        "name": "Python: Scrapy",
        "type": "python",
        "request": "launch",
        "module": "scrapy",
        "cwd": "${fileDirname}",
        "args": [
            "crawl",
            "${fileBasenameNoExtension}",
            "--loglevel=ERROR"
        ],
        "console": "integratedTerminal",
        "justMyCode": false
    }
  ]
}

but the name of the field should be the same than the spiders name.

The --loglevel=ERROR is to get an output less verbose ;)

Maximo Silva
  • 59
  • 1
  • 2
  • To make `crawl` works with `${fileBasenameNoExtension}`, make sure that spider class `name` attribute has the same value as script file basename. – omegastripes May 22 '22 at 21:50
3

I made it. The simplest way is to make a runner script runner.py

import scrapy
from scrapy.crawler import CrawlerProcess

from g4gscraper.spiders.g4gcrawler import G4GSpider

process = CrawlerProcess({
    'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)',
    'FEED_FORMAT': 'json',
    'FEED_URI': 'data.json'
})

process.crawl(G4GSpider)
process.start() # the script will block here until the crawling is finished

Then I added breakpoints in the spider while I launched debugger on this file. Reference: https://doc.scrapy.org/en/latest/topics/practices.html

naqushab
  • 754
  • 1
  • 8
  • 24
2

I applied @fmango's code and improve it.

  1. Instead of write a seperated runner file, just paste these code lines at the end of spider.

  2. run python debugger. that's all

if __name__ == '__main__':
    import os
    from scrapy.cmdline import execute

    os.chdir(os.path.dirname(os.path.realpath(__file__)))

    SPIDER_NAME = MySpider.name
    try:
        execute(
            [
                'scrapy',
                'crawl',
                SPIDER_NAME,
                '-s',
                'FEED_EXPORT_ENCODING=utf-8',
            ]
        )
    except SystemExit:
        pass
gisman
  • 41
  • 4
1

Don't need to modify the launch.json, the default "Python: Current File (Integrated Terminal)" works perfectly. For the Python3 project, remember to place the runner.py file at the same level as the scrapy.cfg file (which is the project root).

The runner.py code as @naqushab does above. Note the processs.crawl(className), where the className is the spiders class that you want to set the breakpoint at.

Peter.Wang
  • 2,051
  • 1
  • 19
  • 13