Multiprocess multiple files in a list

Question

I am trying to read a list that contains N number of .csv files stored in a list synchronously.

Right now I do the following:

import multiprocess

Empty list
Append list with listdir of .csv's
def A() -- even files (list[::2])
def B() -- odd files (list[1::2]
Process 1 def A()

Process 2 def B()

def read_all_lead_files(folder):

    for files in glob.glob(folder+"*.csv"):
        file_list.append(files)
        def read_even():
           file_list[::2]    
        def read_odd():
           file_list[1::2]  

     p1 = Process(target=read_even)
     p1.start()
     p2 = Process(target=read_odd)
     p2.start()

Is there a faster way to split up the partitioning of the list to Process function?

You're saying you do different processing on "even" and "odd" files (whatever *that* means)? — tdelaney, May 21 '14 at 21:37
"Faster"? In what sense? Are the two functions actually different in some way? Without at least a little knowledge of what the two functions do and what you're trying to improve, I don't see how we can help you. Post some minimal code. — Henry Keiter, May 21 '14 at 21:38
The actual splitting of the list into even and odd is very fast. But A() / B() on every other file seems very arbitrary. Why are you doing that? — tdelaney, May 21 '14 at 21:44
POSTing to a server. Governance only allows certain amount of connections per POST. Multiple POSTS gets around this issue. — Christopher W, May 21 '14 at 21:47
I asked because I guessed that you wanted a Pool and didn't really need even/odd A/B. Looks like I was correct. — tdelaney, May 21 '14 at 21:51

score 20 · Accepted Answer · answered May 21 '14 at 21:47

20

I'm guessing here at your request, because the original question is quite unclear. Since os.listdir doesn't guarantee an ordering, I'm assuming your "two" functions are actually identical and you just need to perform the same process on multiple files simultaneously.

The easiest way to do this, in my experience, is to spin up a Pool, launch a process for each file, and then wait. e.g.

import multiprocessing

def process(file):
    pass # do stuff to a file

p = multiprocessing.Pool()
for f in glob.glob(folder+"*.csv"):
    # launch a process for each file (ish).
    # The result will be approximately one process per CPU core available.
    p.apply_async(process, [f]) 

p.close()
p.join() # Wait for all child processes to close.

answered May 21 '14 at 21:47

Henry Keiter

16,863
7
51
80

1

glob returns a list. you could replace the for loop with `p.apply_async(process, glob(folder+"*.csv"))` – tdelaney May 21 '14 at 21:49
1

@tdelaney I think you mean `p.map_async`, in which case yes, that's true. I chose to spell it out with an explicit loop because it's easier to see what's going on this way. – Henry Keiter May 21 '14 at 21:52
@HenryKeiter - you're right! got may maps and applies backwards. – tdelaney May 21 '14 at 21:54
thanks, this was very useful for my photo conversion utility. – Gürkan Çetin Dec 18 '21 at 10:46

Multiprocess multiple files in a list

1 Answers1

Linked