I have a relatively large dataset, ~1000 files each with 1000 rows, ~1000 columns. I am running an algorithm where on each step, I need to loop through the entire dataset and compute do matrix multiplies with this data (think algorithms like gradient descent).
Currently, I am storing this data in the 1000 files, and on each iteration, I am looping through the files, opening, doing the computation, saving the result, and closing the file. I then use the results, and repeat the process for the next iteration of the algorithms. It is slow, roughly 2 seconds per step. I have to do tens of thousands of iteration, which easily takes hours. I am using some parallel processing to do this: the files can be opened independently, so I have a multiprocessing.Pool
in Python using Pool.map
to do the analysis per file in a loop in parallel.
I am wondering if there might be a smarter way to do this that can speed up the computation. I have an EC2 instance on AWS with 128 cores, roughly. This is nearly largest instance, but is there a way to perhaps link multiple instances together and parallelize across multiple cores? Or maybe there's a better way to do it.