I am in the process of migrating from MATLAB to Python, mainly because of the vast number of interesting Machine Learning packages available in Python. But one of the issues which have been the source of confusion for me, is parallel processing. In particular, I want to read thousand of text files from disk in a for
loop and I want to do it in parallel. In MATLAB, using parfor
instead of for
will do the trick, but so far I haven't been able to figure out how to do this in python.
Here is an example of what I want to do. I want to read N text files, shape them into a N1xN2 array, and save each one into a a NxN1xN2 numpy array. And this array will be what I return from a function. Assuming the file names are file0001.dat
, file0002.dat
, etc., the code I like to parallelise is as follows:
import numpy as np
N=10000
N1=200
N2=100
result = np.empty([N, N1, N2])
for counter in range(N):
t_str="%.4d" % counter
filename = 'file_'+t_str+'.dat'
temp_array = np.loadtxt(filename)
temp_array.shape=[N1,N2]
result[counter,:,:]=temp_array
I run the codes on a cluster, so I can use many processors for the job. Hence, any comment on which of the parallelisation methods is more suitable for my task (if there are more than one) is most welcome.
NOTE: I am aware of this post, but in that post, there are only out1
, out2
, out3
variables to worry about, and they have been used explicitly as arguments of a function to be parallelised. But here, I have many 2D arrays that should be read from file and saved into a 3D array. So, the answer to that question is not general enough for my case (or that is how I understood it).