2

I am processing a plenty of 4k images by calculating a parameter on small (64X64 pixel) patches of the image. The task is now being carried out in a sequential fashion, one patch at a time. A snippet of my code is copied below to show you the idea.

for (int i = 0; i < imageW / pSize; i++) {
  for (int j = 0; j < imageH / pSize; j++) {

  thisPatch = MatrixUtil.getSubMatrixAsMatrix(image, i * pSize, j * pSize, pSize);
  results[i][j] = computeParamForPatch(thisPatch);
  }
}

I now need to parallelize this to possibly save some time. As you can see, the process for each patch is completely independent from all others. To do so, I either need to remember the location of each patch by using a Map or use forEachOrdered(). Unfortunately I don't think using maps, something like Map<Point, double[][]> would be parallelized. So this is my question: Apart from using forEachOrdered() which affects the performance negatively, is there any other way to process an image in parallel?


One solution: I tried the following code (suggested by @DHa) which makes a significant improvement:

int outputW = imageW / pSize;
int outputH = imageH / pSize;
IntStream.range(0, outputW * outputH).parallel().forEach(i -> {

 int x = (i % outputW);
 int y = (i / outputH);
 tDirectionalities[x][y] = computeDirectionalityForPatch(
                    MatrixUtil.computeParamForPatch(image, x * pSize, y * pSize, pSize));
});

Results:

  • Sequential: 15754 ms
  • Parallel: 5899 ms
Azim
  • 1,596
  • 18
  • 34
  • ExecutorService – DimXenon Apr 29 '18 at 15:42
  • 2
    Why parallelize *this*? It seems like it'd be a lot easier to process the *images* in parallel instead of trying to process each image in parallel while still processing the images sequentially. – Andrew Henle Apr 29 '18 at 15:43
  • @AndrewHenle , For now, that would add a level of difficulty to the project that we want to avoid. What I explained here is not the entire project. – Azim Apr 29 '18 at 15:47
  • 1
    @AndrewHenle because then IO would be even more of a bottleneck. – Mad Physicist Apr 29 '18 at 16:28
  • @MadPhysicist *because then IO would be even more of a bottleneck.* How can that possibly be known if it hasn't been tried? – Andrew Henle Apr 29 '18 at 16:42
  • 1
    A 4K image is heavy on memory so processing them one at a time might clearly be beneficial. – DHa Apr 29 '18 at 16:59
  • 1
    @DHa how are some 20 MB "heavy on memory"? – Turing85 Apr 29 '18 at 17:01
  • @Turing85 Depends on the hardware, we don't have that constraint given here. I regularly work in environments with 128MB available in total. – DHa Apr 29 '18 at 17:05
  • @DHa. Do you do image processing with Java in those environments? – Mad Physicist Apr 29 '18 at 19:15
  • @Mad Physicist Yes :) Turn the question around if you prefer, why waste memory when that gives no benefit? I happily 'waste' loads of memory if that balances well against an increase in performance. Here I can see no such benefit, it is more likely to worsen performance. You could also see it from a UI perspective, would you prefer having results after 5/10/15s or three after 15s? – DHa Apr 30 '18 at 04:39

1 Answers1

2

This solution uses a parallel stream.

See also How many threads are spawned in parallelStream in Java 8 for how to control amount of threads that work on the stream simultaneously.

    int patchWidth = (int)Math.ceil((double)imageW / pSize);
    int patchHeight = (int)Math.ceil((double)imageH / pSize);

    IntStream.range(0, patchWidth * patchHeight).parallel().forEach(i -> {
        int x = (i % patchWidth);
        int y = (i / patchWidth);

        thisPatch = MatrixUtil.getSubMatrixAsMatrix(image, x * pSize, y * pSize, pSize);
        results[x][y] = computeParamForPatch(thisPatch);
    });
DHa
  • 659
  • 1
  • 6
  • 21
  • I made a silly mistake in my test (this is why I removed me comment here). In fact, the parallel version you proposed indeed makes it significantly faster. See my update. – Azim Apr 29 '18 at 22:15
  • 1
    @Azim From the timing results you've given it looks like you are now employing 3 cores on it, parallel() will default to use all - 1 core, so if you use the techniques in the link to increase that to all cores you could increase the result a little bit further, at the cost of possibly starving other threads from CPU time. Perhaps a better approach is to use the remaining core for IO operations that handle input/output to this function. – DHa Apr 30 '18 at 04:44