Scrapy shell against a local file

Question

Before Scrapy 1.0, I could've run the Scrapy Shell against a local file quite simply:

$ scrapy shell index.html

After upgrading to 1.0.3, it started to throw an error:

$ scrapy shell index.html
2015-10-12 15:32:59 [scrapy] INFO: Scrapy 1.0.3 started (bot: scrapybot)
2015-10-12 15:32:59 [scrapy] INFO: Optional features available: ssl, http11, boto
2015-10-12 15:32:59 [scrapy] INFO: Overridden settings: {'LOGSTATS_INTERVAL': 0}
Traceback (most recent call last):
  File "/Users/user/.virtualenvs/so/bin/scrapy", line 11, in <module>
    sys.exit(execute())
  File "/Users/user/.virtualenvs/so/lib/python2.7/site-packages/scrapy/cmdline.py", line 143, in execute
    _run_print_help(parser, _run_command, cmd, args, opts)
  File "/Users/user/.virtualenvs/so/lib/python2.7/site-packages/scrapy/cmdline.py", line 89, in _run_print_help
    func(*a, **kw)
  File "/Users/user/.virtualenvs/so/lib/python2.7/site-packages/scrapy/cmdline.py", line 150, in _run_command
    cmd.run(args, opts)
  File "/Users/user/.virtualenvs/so/lib/python2.7/site-packages/scrapy/commands/shell.py", line 50, in run
    spidercls = spidercls_for_request(spider_loader, Request(url),
  File "/Users/user/.virtualenvs/so/lib/python2.7/site-packages/scrapy/http/request/__init__.py", line 24, in __init__
    self._set_url(url)
  File "/Users/user/.virtualenvs/so/lib/python2.7/site-packages/scrapy/http/request/__init__.py", line 59, in _set_url
    raise ValueError('Missing scheme in request url: %s' % self._url)
ValueError: Missing scheme in request url: index.html

Is this behavior intended or is this a bug in Scrapy Shell?

As a workaround, I can use an absolute path to the file in a "file" URL scheme:

$ scrapy shell file:////absolute/path/to/index.html

which is, obviously, much less convenient and easy.

Scrapy is already on track to treat `scrapy shell index.html` as `scrapy shell http://index.html/`. https://github.com/scrapy/scrapy/pull/1498 For your convenience you can change your workaround to `scrapy shell file://$PWD/index.html` on *nix systems. — digenishjkl, Oct 14 '15 at 08:46
@digenishjkl thanks for the link to the changeset and the shortcut for the nix systems. I guess I should create an issue at scrapy github issue tracker so that we can get that "convenience" back. — alecxe, Oct 15 '15 at 23:46
Okay, created an issue in the Scrapy github issue tracker: https://github.com/scrapy/scrapy/issues/1550. — alecxe, Oct 19 '15 at 20:36

alecxe · Accepted Answer · 2018-07-05T20:12:46.530

9

Update: for Scrapy >=1.1, this is a built-in feature, you can do:

scrapy shell file:///path/to/file.html

Old answer:

As per discussion in Running scrapy shell against a local file, the relevant change was introduced by this commit. There was a Pull Request for this issue created to make Scrapy shell open local files again and, it is planned to be a part of Scrapy 1.1.

edited Jul 05 '18 at 20:12

answered Nov 09 '15 at 04:19

alecxe

462,703
120
1,088
1,195

1

With Scrapy 1.5.0, I did `scrapy shell file:///path/to/file.html`. Also, I can put the same `file:///path/to/file.html` in the `start_urls` – Shadi Jul 05 '18 at 10:41

score 4 · Answer 2 · answered Jan 11 '22 at 20:20

4

For Scrapy==2.5.1 you can run the Scrapy Shell against a local file like this:

If the file is in the same directory, use a "./" before your file name, like this:

scrapy shell ./file.html

answered Jan 11 '22 at 20:20

dshefman

937
9
19

score 2 · Answer 3 · answered Sep 08 '20 at 01:32

2

With following configuration

MacOS X
Scrapy 1.6.0

what worked for me is going with scrapy shell ./index.html with index.html being in the root folder of your scrapy generated project

answered Sep 08 '20 at 01:32

Florent Roques

2,424
2
20
23

Exactly, as easy as it sounds. Scrapy 2.4.1 gives a nice `DEBUG: Crawled (200) (referer: None)`. – Kulbi Jul 31 '22 at 21:03

Scrapy shell against a local file

3 Answers3