1

Hello I am trying to run a script multiple times but would like this to take place at the same time from what I understood i was to use subprocess and threading together however when i run it it still looks like it is being executed sequentially can someone help me so that i can get it to run the same script over and over but at the same time? is it in fact working and just really slow?

edit forgot the last piece of code now at the bottom

here is what i have so far

import os
import datetime
import threading
from subprocess import Popen

today = datetime.date.today()
os.makedirs("C:/newscript_image/" + str(today))

class myThread(threading.Thread):
    def run(self):
        for filename in os.listdir('./newscript/'):
            if '.htm' in filename:
                name = filename.strip('.htm')

                dbfolder = "C:/newscript/db/" + name
                os.makedirs(dbfolder)

                Popen("python.exe C:/execution.py" + ' ' + filename + ' ' + name + ' ' + str(today) + ' ' + dbfolder)
myThread().start()
Null User
  • 338
  • 1
  • 4
  • 17
  • How "at the same time" do they need to be? subprocess will happily execute processes asynchronously. – mgilson Apr 05 '13 at 18:05
  • I think you might not have all the code there. That creates a thread class, but doesn't actually run the thread. It also seems like you could call whatever function you need in execution.py directly instead of invoking it through a new python.exe interpreter. – Michael Greene Apr 05 '13 at 18:06
  • @MichaelGreene thanks for pointing that out i did actually have the myThread.start at the bottom of my code just forgot to include it, so should my threading be done in the execution.py script right now that script is also just running another script with some parameters like script1.py param1 then the next line will be script1.py param2 and so on – The Spiteful Octopus Apr 05 '13 at 18:12

3 Answers3

3

Personally, I'd use multiprocessing. I'd write a function that takes a filename and does whatever the main guts of execution does (probably by importing execution and running some function within it):

import multiprocessing
import execution
import datetime

#assume we have a function:
#exection.run_main_with_args(filename,name,today_str,dbfolder)

today = datetime.datetime.today()
def my_execute(filename):
    if '.htm' in filename:
       name = filename.strip('.htm')
       dbfolder = "C:/newscript/db/" + name
       os.makedirs(dbfolder)
       execution.run_main_with_args(filename,name,str(today),dbfolder)

p = multiprocessing.Pool()
p.map(my_execute,list_of_files_to_process)
mgilson
  • 300,191
  • 65
  • 633
  • 696
  • 1
    is there a way i could do this without rewriting the function? right now execution also calls another script and that one also calls about 10 scripts for various other things or do i have to rewrite each one should they all just be combined into one massive script? – The Spiteful Octopus Apr 05 '13 at 18:27
  • 1
    You can do it calling your `Popen` code instead of `execution.run_main_with_args`. The core idea of this answer is the multiprocessing pool of workers and mapping the list of files onto it, so that instead of running 1 extra thread that does the processing, you have a pool of workers doing the processing. mgilson's solution just avoids the overhead of Popen and calling out to a new python.exe interpreter that way. – Michael Greene Apr 05 '13 at 21:36
2

Ran some quick tests. Using the framework of your script:

#!/usr/bin/env python

import os
import threading
from subprocess import Popen

class myThread(threading.Thread):
    def run(self):
        for filename in os.listdir("./newscript/"):
            if '.htm' in filename:
                Popen("./busy.sh")

myThread().start()

I then populated the "newscript" folder with a bunch of ".htm" files against which to run the script against.

Where "busy.sh" is basically:

#!/usr/bin/env bash
while :
do
    uptime >> $$
    sleep 1
done

The code you have does indeed fire off multiple processes running in the background. I did this with a newscript folder containing 200 files, and I see 200 processes all running in the background.

You noted that you want them to run all in the background at the same time.

For the most part, parallel processes are running in the background "roughly" in parallel, but because of the way that most common operating systems are setup, "parallel" is more like "nearly parallel" or more commonly referred to as asynchronously. If you look at the access times VERY closely, the various processes spawned in this manner will each take a turn, but they will never all do something at the same time.

That is something to be aware of. Especially since you are accessing files controlled by the OS and underlying filesystem.

For what you are trying to do: process a bunch of files inbound, how you are doing it is basically spawning off a process to process the file in the background for each file that appears.

There are a couple of issues with the logic as presented:

  1. High risk of a fork bomb situation, as your spawning is unbounded and there is no tracking of what is still spawned.
  2. The way you are spawning, by calling out and executing another program results in an OS level process being spawned, which is more resource intensive.

Suggestion:

Instead of spawning off jobs, you would be better off taking the file processing code you would be spawning and turning it into a Python function. Re-write your code as a daemonized process, which watches the folder and keeps track of how many processes are spawned, so that the level of background processes handing file conversion is managed.

When processing the file, you would spin off a Python thread to handle it, which would be a lighter weight alternative to spawning off an OS level thread.

Wing Tang Wong
  • 802
  • 4
  • 10
0

A little elaboration mgilson's answer:

Let's say we have a folder example1.
Inside example1 we have two python scripts:
execution.py, and main.py

The contents of execution.py looks like this:

import subprocess


def run_main_with_args(filename,name,today,dbfolder):
    print('{} {} {}'.format('\nfilename: ',filename, ''))
    print('{} {} {}'.format('name: ',name, ''))
    print('{} {} {}'.format('today: ',today, ''))
    print('{} {} {}'.format('dbfolder: ',dbfolder, ''))

    outfile = dbfolder+ '/' + name + '.txt'
    with open (outfile, 'w') as fout:
        print('name', file=fout)

Also, the contents of main.py look like this:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
#
# Author      : Bhishan Poudel; Physics Graduate Student, Ohio University
# Date        : Aug 29, 2016
#

# Imports
import multiprocessing,os,subprocess
import datetime
import execution  # file: execution.py

#assume we have a function:
#exection.run_main_with_args(filename,name,today_str,dbfolder)

today = datetime.datetime.today()
def my_execute(filename):
    if '.txt' in filename:
       name = filename.strip('.txt')
       dbfolder = "db/" + name
       if not os.path.exists(dbfolder): os.makedirs(dbfolder)
       execution.run_main_with_args(filename,name,str(today),dbfolder)



p = multiprocessing.Pool()
p.map(my_execute,['file1.txt', 'file2.txt'])

Then, if we run this main.py it will create required files in the required directories in parallel way!

BhishanPoudel
  • 15,974
  • 21
  • 108
  • 169