0

I have a relatively large dataset, ~1000 files each with 1000 rows, ~1000 columns. I am running an algorithm where on each step, I need to loop through the entire dataset and compute do matrix multiplies with this data (think algorithms like gradient descent).

Currently, I am storing this data in the 1000 files, and on each iteration, I am looping through the files, opening, doing the computation, saving the result, and closing the file. I then use the results, and repeat the process for the next iteration of the algorithms. It is slow, roughly 2 seconds per step. I have to do tens of thousands of iteration, which easily takes hours. I am using some parallel processing to do this: the files can be opened independently, so I have a multiprocessing.Pool in Python using Pool.map to do the analysis per file in a loop in parallel.

I am wondering if there might be a smarter way to do this that can speed up the computation. I have an EC2 instance on AWS with 128 cores, roughly. This is nearly largest instance, but is there a way to perhaps link multiple instances together and parallelize across multiple cores? Or maybe there's a better way to do it.

Drew Brady
  • 41
  • 3
  • 2
    The first thing that comes to mind would be to consider using lambda functions for independent processing of the 1000 files. Either one lambda per file or in some small batches. You could run 1000 lambdas concurrently, because 1000 is a default limit per region. – Marcin May 22 '20 at 03:16
  • You could consider moving to Spark running on Amazon EMR. It is designed for distributed computing. (More complex to learn, but very powerful.) – John Rotenstein May 22 '20 at 03:58
  • @DrewBrady Could you put some quantitative details about the data and the hardware resources? 1000 pieces of files of what data size : 1000 x 1000 x ( ?_items_? ) [B] - be specific, plus add an as-is state of the duration of the matrix multiplication ( in [us] ) per one shot. Next - what are your hardware RAM-footprint limits ( in [GB] ) and what is the cluster-abstracted NUMA architecture (best post all the hwloc / lstopo details as shown in https://stackoverflow.com/a/60536408/3666197). Without such details, answers to your question will remain generic, without directing you to the best way. – user3666197 May 22 '20 at 07:37
  • You need to start by profiling your application to find out where the bottlenecks are. There are lots of ways to distribute work, but they all add overhead (potentially more than 2 seconds per step). And if it turns out that your real bottleneck is opening and reading the files on each step, none of them will help you. – Parsifal May 22 '20 at 12:32
  • Also, are you processing your data in pure Python, or using a library like NumPy? If the former, consider switching to the latter. – Parsifal May 22 '20 at 12:36

1 Answers1

0

You don't really give enough information for an answer, but there are three things that could be contributing to your (lack of) performance:

  • IO time, from opening each file and reading its contents on each iteration.
  • Python interpretation time (and more than that, the time needed to manage 1,000,000,000 values as objects).
  • The GIL.

To understand which of these is causing your performance problems, you need to measure where the time is taken. This can be as simple as putting some time-tracking variables into your processing function:

import time

def processFile(filename):
    t0 = time.time()
    with open(filename) as f:
        # read your data
    t1 = time.time()
    # do something
    t2 = time.time()
    with open(filename, "w") as f:
        # write the updated data
    t3 = time.time()
    print(f"{t1-t0},{t2-t2},{t3-t2}")

If t1-t0 plus t3-t2 accounts for most of your time, then you can probably solve your problem by reading each file at the start of your program and keeping it in memory. If you have to preserve intermediate results, this won't work.

If t2-t1 is where your problem lies, then you need to move your calculations out of Python. One standard approach is to use NumPy, which performs the calculations in C (and also handles loading and writing the file in a memory-efficient manner).

You can figure out whether it's the bytecode/memory management or the GIL by comparing the time for a single thread operation versus multi-thread. If the time scales by the number of threads, it's the GIL.

Alternatively, you can look at other languages that provide a more efficient way to manage memory and better isolation of threads: Java / C# / Go / C / C++ ... whatever you're comfortable with.

Putting a lot of effort into distributing the work across multiple computers, however, is the last thing that I would try. There's always overhead to distribute work and coordinate results, and in my experience it only provides returns when you're talking terabytes of data, not gigabytes.

Parsifal
  • 3,928
  • 5
  • 9