In a standard setting where there is one module running with multiple threads, we can time the program using real time (a.k.a. wall-clock time) and thread time (total time spent in all threads used by the module). If the real time is low, then we have no problems. The program finished quickly and there's no need to optimize it. However, if the real time is high, we want to lower it, but we don't know what makes the program slow: the efficiency of the algorithm or the parallelization. Now, we can use the thread time to see what the time is used on. If the thread time is low, the parallelization needs to be optimized. If the thread time is high, the algorithm needs to be optimized.
Now, this is well-known and has already been said to some extend on What do 'real', 'user' and 'sys' mean in the output of time(1)?
We run our program in a different setting. We have a huge amount of data, so we need to save and load data from the disk often because we can't keep it all in memory at the same time. To avoid IO as much as possible, we stream one data point at a time through several modules at the same time. To clarify by an example: We have two modules A and B, and some data D. The data is a collection of data points d1, d2, ... . Our pipeline is then defined as:
disk -> d1 -> A -> d1' -> B -> d1'' -> disk
disk -> d2 -> A -> d2' -> B -> d2'' -> disk
and so on.
Now, to add an extra layer, we found that module B was slow, so we parallelized it, and it's super effective. ... if it wasn't for the fact that we can no longer rely on our measurements of real time. Before, we had a timer for each module that started before computing a given data point and suspended it afterwards. Now, we measure the real time of A and B while they run at the same time.
QUESTION
Does there exist a way to measure time for a streamed, parallelized system that makes it possible to reason about where to optimize, and whether to focus on the efficiency of the algorithm or the parallelization?