Scrapy crawler in Cron job

Question

I want to execute my scrapy crawler from cron job .

i create bash file getdata.sh where scrapy project is located with it's spiders

#!/bin/bash
cd /myfolder/crawlers/
scrapy crawl my_spider_name

My crontab looks like this , I want to execute it in every 5 minute

 */5 * * * * sh /myfolder/crawlers/getdata.sh

but it don't works , whats wrong , where is my error ?

when I execute my bash file from terminal sh /myfolder/crawlers/getdata.sh it works fine

is the `sh` "prefix" in `*/5 * * * * sh /myfolder/crawlers/getdata.sh ` necessary to execute shell scripts from `crontab`??? — oldboy, Jul 02 '18 at 02:38

score 34 · Accepted Answer · answered Jun 21 '13 at 13:01

34

I solved this problem including PATH into bash file

#!/bin/bash

cd /myfolder/crawlers/
PATH=$PATH:/usr/local/bin
export PATH
scrapy crawl my_spider_name

answered Jun 21 '13 at 13:01

Beka Tomashvili

2,171
5
21
27

1

+1 Had the same problem and simply couldn't figure it out. You should mark your question as the accepted answer. :) – Xethron Sep 21 '13 at 09:47
4

I guess PATH should not always be set to /usr/local/bin, it depends on what environment and server you are on, right? So what should PATH be set to? The folder of.... ? – Marcus Lind Apr 13 '15 at 11:37
I'm not a Linux guru, can someone ELI5 why executing the bash script from bash works, but executing it in cron does? – joe_coolish Nov 16 '18 at 20:42
@MarcusLind gotcha. PATH should be set to where scrapy is located. You can find this folder with the command: which scrapy – Jhonatas Kleinkauff Sep 24 '19 at 23:55
But don't you have to have it enter the virtual environment first? I'm able to execute my spider with a compound command (enters virtual environment, then starts the scrapy script) but your example is not showing entering the virtual environment. How are you able to make it run without first doing that step? – rom Sep 25 '20 at 06:59
Can you share the complete step by step guide of how did you applying cronjob? – Adeena Lathiya Jul 15 '22 at 22:00

score 13 · Answer 2 · edited Oct 12 '15 at 21:27

Adding the following lines in crontab -e runs my scrapy crawl at 5AM every day. This is a slightly modified version of crocs' answer

PATH=/usr/bin
* 5 * * * cd project_folder/project_name/ && scrapy crawl spider_name

Without setting $PATH, cron would give me an error "command not found: scrapy". I guess this is because /usr/bin is where scripts to run programs are stored in Ubuntu.

Note that the complete path for my scrapy project is /home/user/project_folder/project_name. I ran the env command in cron and noticed that the working directory is /home/user. Hence I skipped /home/user in my crontab above

The cron log can be helpful while debugging

grep CRON /var/log/syslog

score 6 · Answer 3 · answered Jun 19 '17 at 16:54

6

For anyone who used pip3 (or similar) to install scrapy, here is a simple inline solution:

*/10 * * * * cd ~/project/path && ~/.local/bin/scrapy crawl something >> ~/crawl.log 2>&1

Replace:

*/10 * * * * with your cron pattern

~/project/path with the path to your scrapy project (where your scrapy.cfg is)

something with the spider name (use scrapy list in your project to find out)

~/crawl.log with your log file position (in case you want to have logging)

answered Jun 19 '17 at 16:54

nottmey

333
4
7

where does the path `~/.local/bin/scrapy` come from or what is the significance of it? – oldboy Jul 02 '18 at 00:57
1

That's the place where the `scrapy` command was located for me when installing it with `pip3`. Since the plain `scrapy` command was not accessible in my cron context, I solved it by accessing it directly. – nottmey Jul 09 '18 at 23:24
would that prevent the need to alter `PATH`? – oldboy Jul 09 '18 at 23:29
1

yes, `PATH` is irrelevant when accessing the command directly – nottmey Jul 10 '18 at 13:38

score 3 · Answer 4 · answered May 20 '15 at 00:22

Another option is to forget using a shell script and chain the two commands together directly in the cronjob. Just make sure the PATH variable is set before the first scrapy cronjob in the crontab list. Run:

    crontab -e

to edit and have a look. I have several scrapy crawlers which run at various times. Some every 5 mins, others twice a day.

    PATH=/usr/local/bin
    */5 * * * * user cd /myfolder/crawlers/ && scrapy crawl my_spider_name_1
    * 1,13 * * * user cd /myfolder/crawlers/ && scrapy crawl my_spider_name_2

All jobs located after the PATH variable will find scrapy. Here the first one will run every 5 mins and the 2nd twice a day at 1am and 1pm. I found this easier to manage. If you have other binaries to run then you may need to add their locations to the path.

Oni · Answer 5 · 2019-07-13T08:31:49.010

1

Check where scrapy is installed using "which scrapy" command. In my case, scrapy is installed in /usr/local/bin.

Open crontab for editing using crontab -e. PATH=$PATH:/usr/local/bin export PATH */5 * * * * cd /myfolder/path && scrapy crawl spider_name

It should work. Scrapy runs every 5 minutes.

edited Jul 13 '19 at 08:31

answered Oct 17 '18 at 19:36

Oni

1,093
11
16

score 0 · Answer 6 · answered Jun 21 '13 at 12:23

0

does your shell script have execute permission?

e.g. can you do

  /myfolder/crawlers/getdata.sh

without the sh?

if you can then you can drop the sh in the line in cron

answered Jun 21 '13 at 12:23

KeepCalmAndCarryOn

8,817
2
32
47

No it writes that permissions is denied – Beka Tomashvili Jun 21 '13 at 12:26
you need to do `chmod u+x /myfolder/crawlers/getdata.sh` to give it execute permission. that is what the `#!/bin/bash` line does - which must be the first line in the file – KeepCalmAndCarryOn Jun 21 '13 at 12:29
i give it permissions and remove "sh" in crontab, but it still does not works :S – Beka Tomashvili Jun 21 '13 at 12:36
1

you can add `>/tmp/cron.log 2>&1` to the end of your command to see the errors. Possibly the script doesn't have access to `scrappy` if it is in a non standard place – KeepCalmAndCarryOn Jun 21 '13 at 12:41

score 0 · Answer 7 · answered Dec 20 '18 at 07:24

in my case scrapy is in .local/bin/scrapy give the proper path of scraper and name it worK perfect

0 0 * * * cd /home/user/scraper/Folder_of_scriper/ && /home/user/.local/bin/scrapy crawl "name" >> /home/user/scrapy.log 2>&1

/home/user/scrapy.log it use to save the output and error in scrapy.log for check it program work or not

thank you.

Archer0730 · Answer 8 · 2023-01-05T08:23:50.297

I run my scrapy spider on a raspberry pi, OS (Debian version: 11 (bullseye)). The following settings/workflow worked for me:

First cd to your project directory. Install scrapy in a venv environment using:

python3 -m venv ./venv
source ./venv/bin/activate
sudo pip3 install scrapy

Create your spiders.

Create the shell file (getdata.sh), use full directory paths (including /home/username/etc..):

#!/bin/bash
#activate virtual environment
source "/full/path/to/project/venv/bin/activate"

#move to the project directory 
cd /full/path/to/project/

#start spider
scrapy crawl my_spider_name

Schedule the spider in crontab using the following line in crontab -e:

   */5 * * * * /full/path/to/shfile/getdata.sh

Scrapy crawler in Cron job

8 Answers8

Linked