0

So I have two python scripts. The first is a parser that scans through thousands of files and the second is a scheduler that forks the scan on hundreds of separate directories. My problem is this:

I have a limited amount of disk resources and each scan uses around 1GB of local sqlite3 storage. I need to limit the number of processes so that while the max number of processes is running, I wont get a disk IO error, which I've been getting.

I've tried using the following code to fork the scans and keep the processes at 8, but when I look in my temp directory (where the temp local files are stored) there are substantially more than 8 files showing my that I'm not limiting the processes properly (I use os.remove to get rid of the temp files after the scan is done).

This is my execute scan method that just forks off a process with a well-formatted command

def execute_scan(cmd):
    try:
        log("Executing "+ str(cmd))
        subprocess.call(cmd, shell=False)
    except Exception as e:
        log(e)
        log(cmd)

This is in my main method, where getCommand(obj) converts data in an object to a command array.

tasks = [getCommand(obj) for obj in scanQueue if getCommand(obj) is not None]   
multiprocessing.Pool(NUM_PROCS).map(execute_scan, tasks)

I could use any advice I can get because I'm dealing with a lot of data and my disk is not that big.

Thanks a lot!

onetwopunch
  • 3,279
  • 2
  • 29
  • 44
  • 1
    I'll repeat this once more: [parallelizing I/O bound tasks leads to worse runtimes than running the tasks in a single thread (or process)](http://stackoverflow.com/a/20421535/1595865). Using multiple threads or processes is only useful if you're dealing with a CPU bound task (and even then, not everytime) – loopbackbee Dec 09 '13 at 17:44
  • You haven't shown any code that removes temp files, but that's probably where the problem lies. Looking at temp files to indirectly infer how many processes are running is bizarre ;-) Use an OS tool to count the number of processes directly. `Pool(NUM_PROCS)` creates *exactly* `NUM_PROCS` processes - no more and no less. – Tim Peters Dec 09 '13 at 17:53
  • @TimPeters As I mention in the description, I remove the temp files with os.remove(path) and that part works fine. – onetwopunch Dec 09 '13 at 21:08
  • @goncalopp Do you think I could just avoid the multiprocessing altogether due to the bottleneck at the IO portion? Would I gain any speed doing it that way due to less context switching overhead? – onetwopunch Dec 09 '13 at 21:11
  • @ecesurfer, nevertheless, you did not show the code in context. Can only repeat that `Pool(NUM_PROCS)` creates exactly `NUM_PROCS` processes, and counting temp files is **not** counting processes. In other words, I don't believe you ;-) If your OS shows more processes than that, then I might. – Tim Peters Dec 09 '13 at 21:22
  • @ecesurfer, timing depends on so many details you shouldn't believe any "head argument": try various ways and time them! That's dead easy. It's certainly *possible* that multiprocessing is slower here; it's also possible that it's faster. The only thing that can tell you which, on your system, with your data, and your code, is your clock. If it is slower, it won't be because of "context switching", it would be because multiple processes are grinding your disk to dust by forcing the read heads to leap all over the place all the time. Disks are partly mechanical, and like reading contiguously. – Tim Peters Dec 09 '13 at 21:47

2 Answers2

0

gevent.pool.Pool may be a proper solution for you. Because gevent uses greenlets to do concurrency operations and only one greenlet could run at a time.

In your situation, firstly, set the pool size to a proper number which means that there are at most that number of greenlets could do some I/O operations. Then turn the function which does the scan task into a greenlet and add it to the pool to be scheduled by the hub greenlet.

Here is a brief tutorial about the usage of gevent.pool.Pool

flyer
  • 9,280
  • 11
  • 46
  • 62
0

Though I could have probably used multiprocessing on this application, it turns out that because the IO to the sqlite3 database was the bottleneck, multiprocessing was actually slowing it down just as goncalopp predicted.

onetwopunch
  • 3,279
  • 2
  • 29
  • 44