3

I'm working on a Python script that will constantly scrape data, but it will take quite a long time. Is there a safe way to stop a long running python script? The loop will run for more than 10 minutes and I need a way to stop it if I want, after it's already running.

If I execute it from a cron job, then I'm assuming it'll just run until it's finished, so how do I stop it?

Also, if I run it from a browser and just call the file. I'm assuming stopping the page from loading would halt it, correct?


Here's the scenario:
I have one python script that is gather info from pages and put it into a queue. Then I want to have another python script that is in an infinite loop that just checks for new items in the queue. Lets say I want the infinite loop to begin at 8am and end at 8pm. How do I accomplish this?

Rawr
  • 2,206
  • 3
  • 25
  • 53
  • 1
    a) What operating system? b) How are you running it? (you list cron as one example- is that the only example?) c) What browser, and what web server (e.g. apache)? For example, if you are running it from the command line on a Mac, you'd do COMMAND-., or Control-C in Linux. – David Robinson Aug 10 '12 at 09:08
  • How is the information shared between the two programs (the queue)? – David Robinson Aug 10 '12 at 09:09
  • a) windows b) it'll get start up by either a php call or a python call c) I'd be using firefox or chrome and I'm running it on my own windows pc with a wamp server. I just added more details in the "here's the scenario". The information will be shared with either a txt file or a mysql database. I'm thinking of putting a halt/sleep/wait at the end of the loop so it waits at least a couple seconds before checking the queue again. – Rawr Aug 10 '12 at 09:10
  • @Alp so something like `os.system("killall")`. Is there a way to check to make sure it worked. Like some command I can run to see what scripts are running, then use `killall` and check again to make sure it ended? – Rawr Aug 10 '12 at 09:12
  • 2
    very bad programm design – Dmitry Zagorulkin Aug 10 '12 at 09:13
  • @Rawr: killall is *way* overkill for this situation. It will stop all python programs running on the machine, and you might have others that you're trying to run at the time. – David Robinson Aug 10 '12 at 09:14
  • @Zagorulkin Dmitry it's only for the initial scrape. It's not gonna run indefinitely. I understand having something loop forever is a waste of resources. – Rawr Aug 10 '12 at 09:14
  • @Rawr: I'm not sure why just checking the time (as in my answer below) doesn't work for you. – David Robinson Aug 10 '12 at 09:16
  • @David Robinson I was looking for some command I could call in another script. something like stop.py that could just call out a simple command to stop the python script queue.py – Rawr Aug 10 '12 at 09:17
  • 2
    Just press the power off button patiently. – Vidul Aug 10 '12 at 09:18
  • Why in another script? Why not just directly? – David Robinson Aug 10 '12 at 09:19
  • @DavidRobinson Cause if it's running already, then how can I send an input to the same script during its execution? – Rawr Aug 10 '12 at 09:20
  • Also, do not *ever* do `os.system("killall")`: that will kill all processes on your computer, like your other applications. @Alp's suggestion was `os.system("killall python")`, which kills all python processes. – David Robinson Aug 10 '12 at 09:20
  • @Rawr: if you're running it from the terminal, you can do so using a keyboard interrupt. If it's running in the background, you could run `killall python` *directly* in the command line. Again: how are you starting the python process? – David Robinson Aug 10 '12 at 09:21
  • @DavidRobinson THANKS, that could have been unpleasant :/ – Rawr Aug 10 '12 at 09:21

3 Answers3

5

Let me present you an alternative. It looks like you want real-time updates for some kind of information. You could use a pub/sub interface (publish/subscribe). Since you are using python, there are plenty of possibilities.

One of them is using Redis pub/sub functionality: http://redis.io/topics/pubsub/ - and here is the corresponding python module: redis-py

-Update-

Example

Here is an example from dirkk0 (question / answer):

import sys
import threading

import cmd


def monitor():
    r = redis.Redis(YOURHOST, YOURPORT, YOURPASSWORD, db=0)

    channel = sys.argv[1]
    p = r.pubsub()

    p.subscribe(channel)

    print 'monitoring channel', channel
    for m in p.listen():
        print m['data']


class my_cmd(cmd.Cmd):
    """Simple command processor example."""

    def do_start(self, line):
        my_thread.start()

    def do_EOF(self, line):
        return True

if __name__ == '__main__':
    if len(sys.argv) == 1:
        print "missing argument! please provide the channel name."
    else:
        my_thread = threading.Thread(target=monitor)
        my_thread.setDaemon(True)

        my_cmd().cmdloop()

-Update 2-

In addition, look at this tutorial:

http://blog.abourget.net/2011/3/31/new-and-hot-part-6-redis-publish-and-subscribe/

Community
  • 1
  • 1
Alp
  • 29,274
  • 27
  • 120
  • 198
  • oh this looks very nice, could you explain a bit more how it works? – Rawr Aug 10 '12 at 09:18
  • Just so I'm clear. I use redis which listens. I use file1.py to send information to redis, which in turn executes file2.py which runs whenever I need it to. – Rawr Aug 10 '12 at 09:32
0

I guess one way to work around the issue is having a script for one loop run, that would:

  1. check no other instance of the script is running
  2. look into the queue and process everything found there

Now, then you can run this script from cron every minute between 8 a.m. and 8 p.m. The only downside is that new items may some time to get processed.

che
  • 12,097
  • 7
  • 42
  • 71
0

i think holding browser page does not necessarily stop the python script, I suggest that you start your script under control of a parent process using FORK:

  • Example :

import os, time, signal

def child():
   print 'A new child ',  os.getpid( )
   time.sleep(5)
   os._exit(0)  

def parent():
   while True:
      newpid = os.fork()
      if newpid == 0:
         child()
      else:
         pids = (os.getpid(), newpid)
         print "parent: %d, child: %d" % pids
         print "start counting time for child process...!"
         time1 = time.clock()
         while True:
                  #time.sleep(1)
                  time2 = time.clock()
                  # Check if the execution time for child process exceeds 10 minutes... 
                  if time2-time1 >= 2 :
                           os.kill(int(newpid), signal.SIGKILL)
                           break

      if raw_input( ) == 'q': break

parent()
Emine
  • 1
  • 1