1

I have multiple data files that I process using python Pandas libraries. Each file is processed one by one, and only one logical processor is used when I look at Task manager (it is at ~95%, and the rest are within 5%)

Is there a way to process data files simultaneously? If so, is there a way to utilize the other logic processors to do that?

(Edits are welcome)

Jek Denys
  • 105
  • 2
  • 12

2 Answers2

1

If your file names are in a list, you could use this code:

from multiprocessing import Process

def YourCode(filename, otherdata):
    # Do your stuff

if __name__ == '__main__':
    #Post process files in parallel
    ListOfFilenames = ['file1','file2', ..., 'file1000']
    ListOfProcesses = []
    Processors = 20 # n of processors you want to use
    #Divide the list of files in 'n of processors' Parts
    Parts = [ListOfFilenames[i:i + Processors] for i in xrange(0, len(ListOfFilenames), Processors)]

    for part in Parts:
        for f in part:
            p = multiprocessing.Process(target=YourCode, args=(f, otherdata))
            p.start()
            ListOfProcesses.append(p)
        for p in ListOfProcesses:
            p.join()
Diego
  • 1,232
  • 17
  • 20
  • 2
    Take a look at `concurrent.futures.ProcessPoolExecutor` - the same idea, but carefully thught of, and with corner cases, and such - https://docs.python.org/3/library/concurrent.futures.html – jsbueno Jan 16 '17 at 17:05
  • Python 2.7 is seven years old now - and it was already somewhat old when it was released, as Python 3 was alrady around. the OP does not mention he is using Python2. (of course an answer suggesting concurrent.futures would have to mention it is Python 3 only) – jsbueno Jan 16 '17 at 17:15
  • You are right. I should have specified. I'm using python 3.5 – Jek Denys Jan 16 '17 at 18:01
0

You can process the different files in different threads or in different processes.

The good thing of python is that its framework provides tools for you to do this:

from multiprocessing import Process

def process_panda(filename):
    # this function will be started in a different process
    process_panda_import()
    write_results()

if __name__ == '__main__':
    p1 = Process(target=process_panda, args=('file1',))
    # start process 1
    p1.start() 
    p2 = Process(target=process_panda, args=('file2',))
    # starts process 2
    p2.start() 
    # waits if process 2 is finished
    p2.join()  
    # waits if process 1 is finished
    p1.join()  

The program will start 2 child-processes, which can be used do process your files. Of cource you can do something similar with threads.

You can find the documentation here: https://docs.python.org/2/library/multiprocessing.html

and here:

https://pymotw.com/2/threading/

KimKulling
  • 2,654
  • 1
  • 15
  • 26
  • 2
    Quick note: it looks like Python threads won't use multiple cores, according to: http://stackoverflow.com/questions/7542957/is-python-capable-of-running-on-multiple-cores. The `multiprocessing` library will use it, though. – phss Jan 16 '17 at 16:47
  • 1
    @KimKulling, Tank you for the code and the additional links :) – Jek Denys Jan 17 '17 at 14:44