reading large pandas datastructure via multiprocessing Process

Question

I am trying to read in a large file as pandas datastructure into a process using multiprocessing module. In my code that I have attached below, I successfully get done with the read_file function because "2" gets printed out. But then the python command window gets stuck at p1.join() as "3" never gets printed.

I read that multiprocessing process has a size limit to it, if that is the reason my file isn't getting thru, can anyone suggest an alternative to reading a large panda structure as a separate process?

In the very end, I hope to read two large panda structures simultaneously and concatenate them in the main function to halve the script time.

import pandas as pd
from multiprocessing import Process, Queue

def read_file(numbers,retrns):
    Product_Master_XLSX = pd.read_excel(r'G:\PRODUCT MASTER.xlsx',sheetname='Table')
    retrns.put(Product_Master_XLSX)
    print "2"

if __name__ == "__main__":
    arr = [1]
    queue1 = Queue()
    p1 = Process(target=read_file, args=(arr,queue1))
    p1.start()
    print "1"
    p1.join()
    print "3"
    print queue1.get()

Any reason why you need to do this in a new process? You're not doing anything useful in the main thread. — cs95, Sep 11 '17 at 20:10
@cᴏʟᴅsᴘᴇᴇᴅ Once I am able to read this file, i intend to have many files read simultaneously using processes and concatenate them in the main — python_enthusiast, Sep 11 '17 at 20:13
Multiprocessing can help when a problem is *CPU-bound*. It does not help when a problem is *I/O-bound*. Reading from a file is I/O-intensive, not CPU-intensive. (One process will simply wait until the other process is done reading from disk.) So I don't think using multiprocessing is going to speed up the process of reading from files. — unutbu, Sep 11 '17 at 20:34
Along with the sequential reading from file problem mentioned above, items in the Queue are pickled by the sending process and unpickled by the receiving process. Those are extra steps that multiprocessing code needs to do that sequential code does not. Multiprocessing will improve speed only when the amount of parallel computation dwarfs the extra overhead of things like starting the processes and communicating through Queues. Since your code does not do much parallel computation, **even if you avoid the deadlock, your parallel code will run slower than basic sequential code**. — unutbu, Sep 11 '17 at 20:47
@unutbu Thank you for your comments. I have stopped trying to have file read via processes, now i am only downloading multiple files simultaneously via processes. Which brings me to my next concern. How to have separate download directories for each download file. I will create a new question for it. Please respond if you can. — python_enthusiast, Sep 12 '17 at 17:41
To avoid deadlock, "you need to .get() all the items off the queue before you attempt to .join() the processes." See [Tim Peters' answer, here](https://stackoverflow.com/a/45948595/190597). — unutbu, Sep 13 '17 at 00:29

reading large pandas datastructure via multiprocessing Process

0 Answers0