0

I found that my problem is very similar to scrapyd deploy shows 0 spiders. I also tried the accepted answer several times, but it doesn't work for me, so I come for some help.

The project directory is timediff_crawler, and tree view of the directory is:

timediff_crawler/
├── scrapy.cfg
├── scrapyd-deploy
├── timediff_crawler
│   ├── __init__.py
│   ├── items.py
│   ├── pipelines.py
│   ├── settings.py
│   ├── spiders
│   │   ├── __init__.py
│   │   ├── prod
│   │   │   ├── __init__.py
│   │   │   ├── job
│   │   │   │   ├── __init__.py
│   │   │   │   ├── zhuopin.py
│   │   │   ├── rent
│   │   │   │   ├── australia_rent.py
│   │   │   │   ├── canada_rent.py
│   │   │   │   ├── germany_rent.py
│   │   │   │   ├── __init__.py
│   │   │   │   ├── korea_rent.py
│   │   │   │   ├── singapore_rent.py
...

1.1 start scrapyd, it's ok

(crawl_env)web@ha-2:/opt/crawler$ scrapyd
2015-11-11 15:00:37+0800 [-] Log opened.
2015-11-11 15:00:37+0800 [-] twistd 15.4.0 (/opt/crawler/crawl_env/bin/python 2.7.6) starting up.
2015-11-11 15:00:37+0800 [-] reactor class: twisted.internet.epollreactor.EPollReactor.
2015-11-11 15:00:37+0800 [-] Site starting on 6800
...

1.2 edit scrapy.cfg

[settings]
default = timediff_crawler.settings

[deploy:ha2-crawl]
url = http://localhost:6800/
project = timediff_crawler

1.3 deploy the project

(crawl_env)web@ha-2:/opt/crawler/timediff_crawler$ ./scrapyd-deploy -l
ha2-crawl            http://localhost:6800/

(crawl_env)web@ha-2:/opt/crawler/timediff_crawler$ ./scrapyd-deploy ha2-crawl -p timediff_crawler
Packing version 1447229952
Deploying to project "timediff_crawler" in http://localhost:6800/addversion.json
Server response (200):
{"status": "ok", "project": "timediff_crawler", "version": "1447229952", "spiders": 0, "node_name": "ha-2"}

1.4 the problem

The response shows that the number of spiders is 0, actually I have about 10 spiders.

I followed the advice in this post scrapyd deploy shows 0 spiders, delete all projects, versions, directories(include build/ eggs/ project.egg-info setup.py) and try to deploy again, but it doesn't work, the number of spiders is always 0.

I validate the egg file, the output shows it seems ok:

(crawl_env)web@ha-2:/opt/crawler/timediff_crawler/eggs/timediff_crawler$ unzip -t 1447229952.egg 
Archive:  1447229952.egg

testing: timediff_crawler/pipelines.py   OK
testing: timediff_crawler/__init__.py   OK
testing: timediff_crawler/items.py   OK
testing: timediff_crawler/spiders/prod/job/zhuopin.py   OK
testing: timediff_crawler/spiders/prod/rent/singapore_rent.py   OK
testing: timediff_crawler/spiders/prod/rent/australia_rent.py   OK
...

So I don't know what's going wrong, please help and thanks in advance!

Community
  • 1
  • 1
Michael
  • 1,667
  • 2
  • 17
  • 18
  • can you share your settings file? create a pastebin – eLRuLL Nov 11 '15 at 11:10
  • @eLRuLL, here: [settings.py](http://pastebin.com/GBnniUCg) – Michael Nov 11 '15 at 12:05
  • try changing your `SPIDER_MODULES`, to the path where your spiders are: for example `['timediff_crawler.spiders.prod.job']`. – eLRuLL Nov 11 '15 at 12:21
  • @eLRuLL I just tried again, doesn't work. I also tried to put all spiders source file in `timediff_crawler/spiders` directly, not work either. – Michael Nov 11 '15 at 13:47
  • when you run `scrapy list`, is it listing your spiders? – eLRuLL Nov 11 '15 at 15:02
  • I totally understand where you are coming you Daniel, `scrapyd` makes it really hard to deploy scrapy spiders unless you create the right kinda python egg. Facing the same kind of frustrations I went ahead and started my own project that helps run scrapy spiders programmatically with a good UI. Check it [here](https://github.com/kirankoduru/arachne) –  Nov 11 '15 at 23:23
  • @eLRuLL Yes,`scrapy list` can list all spiders, so that now I run the spiders by crontab. – Michael Nov 12 '15 at 00:57
  • @kiran.koduru, It's a good project, I'll try it, thank you! – Michael Nov 12 '15 at 01:06
  • Two things worth trying: the first is to try to start `scrapyd` in the base directory of your project which is `/opt/crawler/timediff_crawler`. Second: did you kill/delete the old scrapyd-deloy and restart `scrapyd` before trying new deploy? – LearnAWK Nov 16 '15 at 04:11
  • I saw a `scrapyd-deploy` folder in the base folder of your scrapy project. I don't have such a folder. What is in it? – LearnAWK Nov 19 '15 at 23:18

4 Answers4

4

Thanks to @LearnAWK's advice, the problem is caused by the following configuration in settings.py:

LOG_STDOUT = True

Actually I don't konw why this configuration affects the result of scrapyd.

Michael
  • 1,667
  • 2
  • 17
  • 18
0

The spiders should be in the folder of spiders, not the subfolders.

The following is the directory tree within the base folder of my scrapy project. Is yours similar, which was cutoff in your question?

my_scrapy_project/   
├── CAPjobs    
│   └── spiders    
│       └── oldspiders    
├── build    
│   ├── bdist.macosx-10.9-x86_64    
│   └── lib    
│       └── CAPjobs    
│           └── spiders    
├── db    
├── dbs    
├── eggs    
│   └── CAPjobs    
├── items    
│   └── CAPjobs    
│       ├── spider1    
│       ├── spider2    
│       └── spider3    
├── logs   
│   └── CAPjobs    
│       ├── spider1    
│       ├── spider2    
│       └── spider3 
└── project.egg-info    
LearnAWK
  • 549
  • 6
  • 17
  • I tried to put all spiders into the `spiders` folder, but it didn't work. – Michael Nov 19 '15 at 11:02
  • After you move all spiders in to the `spiders` folder, did you delete all the folders and files generated by prior tries of `scrapyd-deploy`? Did you delete the project from `localhost:6800`? Did you restart `localhost:6800`? Before retry `scrapyd-deploy`? – LearnAWK Nov 19 '15 at 23:08
  • 2
    If still not working, my suggest is to restart a fresh scrapy project with new folders. Then copy the very basic elements of your current scrapy files into the new folders, one spider at a time. And see whether you have any luck. – LearnAWK Nov 19 '15 at 23:10
  • 1
    thanks to **LearnAWK's** advice, I think I find out the problem. in `settings.py`, I add the following configuration: LOG_ENABLED = True LOG_LEVEL = 'INFO' LOG_STDOUT = True `LOG_STDOUT = True` is the cause, when comment out it, everything is ok. By the way, spiders could be in subfolders of `spiders` folder. – Michael Nov 20 '15 at 09:56
  • Quite interesting regarding the subfolder of `spiders`: I have a subfolder holding some spider scripts not needed for the current project. Those files didn't show up during `scrapyd-deploy` command. It seems my experience is a little different. – LearnAWK Nov 20 '15 at 20:32
  • Do the spiders in the subfolder show up when running `scrapy list`? – Michael Nov 21 '15 at 05:15
0

This is a late reply to this thread but I believe I figured out why, on submitting a new project version to Scrapyd, that 0 spiders is reported back.

In my case my spiders have quite a few dependencies - Redis, Elasticsearch, certifi (for connecting to ES using SSL), etc. Running Scrapyd on a different machine (production system) you have to exactly replicate your Python dependencies. On a local machine, if you have the same Python virtualenv active you won't run into the issue.

I found that, using the standard scrapy tutorial, I could successfully add a new project version to Scrapyd. From there I started stripping down and commenting out lines of code and imports in my spiders until I was able to successfully add them to Scrapyd. When I could get scrapyd-deploy to report back 0 spiders by uncommenting a single import, it dawned on me that it was a dependency issue.

Hopefully this will help someone else having the same issue.

It would be very nice to have scrapyd report back in the deploy response if there were dependencies that failed.

0

Development environment is different than production. Scrapyd can not find other classes. Try to move related classes or modules under spider folder. Then give correct references again.

For defining a problematic reference you can try to comment out the references one by one and try scrapy-deploy

├── Myapp
│ └── spider
│ ├── bbc_spider.py

│── Model.py

At the development environment this works but after deploy by scrapyd it may not. Because bbc_spider.py can not reach to model.py . If you will, move the model.py under to spider folder. I solved problem like that.

Danko Valkov
  • 1,038
  • 8
  • 17
MrCode
  • 35
  • 1
  • 1
  • 6