2

I'm executing python code on several files. Since the files are all very big and since one call one treats one file, it lasts very long till the final file is treated. Hence, here is my question: Is it possible to use several workers which treat the files in parallel?

Is this a possible invocation?:

import annotation as annot # this is a .py-file
import multiprocessing

pool = multiprocessing.Pool(processes=4)
pool.map(annot, "")

The .py-file uses for-loops (etc.) to get all files by itself. The problem is: If I have a look at all the processes (with 'top'), I only see 1 process which is working with the .py-file. So...I suspect that I shouldn't use multiprocessing like this...does I? Thanks for any help! :)

Stack
  • 1,028
  • 2
  • 10
  • 31
MarkF6
  • 493
  • 6
  • 21

4 Answers4

4

Yes. Use multiprocessing.Pool.

import multiprocessing
pool = multiprocessing.Pool(processes=<pool size>)
result = pool.map(<your function>, <file list>) 
sdamashek
  • 636
  • 1
  • 4
  • 13
2

My answer is not purely a python answer though I think it's the best approach given your problem.

This will only work on Unix systems (OS X/Linux/etc.).

I do stuff like this all the time, and I am in love with GNU Parallel. See this also for an introduction by the GNU Parallel developer. You will likely have to install it, but it's worth it.

Here's a simple example. Say you have a python script called processFiles.py:

#!/usr/bin/python
# 
# Script to print out file name
#
fileName = sys.argv[0] # command line argument
print( fileName ) # adapt for python 2.7 if you need to

To make this file executable:

chmod +x processFiles.py

And say all your large files are in largeFileDir. Then to run all the files in parallel with four processors (-P4), run this at the command line:

$ parallel -P4 processFiles.py ::: $(ls largeFileDir/*)

This will output

file1
file3
file7
file2
...

They may not be in order because each thread is operating independently in parallel. To adapt this to your process, insert your file processing script instead of just stupidly printing the file to screen.

This is preferable to threading in your case because each file processing job will get its own instance of the Python interpreter. Since each file is processed independently (or so it sounds) threading is overkill. In my experience this is the most efficient way to parallelize a process like you describe.

There is something called the Global Interpreter Lock that I don't understand very well, but has caused me headaches when trying to use python built-ins to hyperthread. Which is why I say if you don't need to thread, don't. Instead do as I've recommended and start up independent python processes.

Community
  • 1
  • 1
Matthew Turner
  • 3,564
  • 2
  • 20
  • 21
1

There are many options.

  • multiple threads
  • multiple processes
  • "green threads", I personally like Eventlet

Then there are more "Enterprise" solutions, which are even able running workers on multiple servers, e.g. Celery, for more search Distributed task queue python.

In all cases, your scenario will become more complex and sometime you will not gain much, e.g. if your processing is limited by I/O operations (reading the data) and not by computation and processing.

Jan Vlcinsky
  • 42,725
  • 12
  • 101
  • 98
1

Yes, this is possible. You should investigate the threading module and the multiprocessing module. Both will allow you to execute Python code concurrently. One note with the threading module, though, is that because of the way Python is implemented (Google "python GIL" if you're interested in the details), only one thread will execute at a time, even if you have multiple CPU cores. This is different from the threading implementation in our languages, where each thread will run at the same time, each using a different core. Because of this limitation, in cases where you want to do CPU-intensive operations concurrently, you'll get better performance with the multiprocessing module.

dano
  • 91,354
  • 19
  • 222
  • 219