Parallel IntStream hack

Question

Here's some code.

public void blur(final int x, final int y, final int w, final int h) {
    final Picture p = new Picture(this);
    IntStream.range(x, x + w).parallel().forEach(i
        -> IntStream.range(Y, Y + h).forEach(j
            -> {
                final Pixel pixel = this.getPixel(i, j);
                final java.util.List<Pixel> others
                = Arrays.asList(
                    p.getPixel(i - 1, j),
                    p.getPixel(i, j - 1),
                    p.getPixel(i, j + 1),
                    p.getPixel(i + 1, j),
                    p.getPixel(i - 1, j - 1),
                    p.getPixel(i + 1, j + 1),
                    p.getPixel(i - 1, j + 1),
                    p.getPixel(i + 1, j - 1),
                    pixel
                );
                pixel.setBlue((int) (others.stream()
                    .mapToInt(Pixel::getBlue).average().getAsDouble()));
                pixel.setRed((int) (others.stream()
                    .mapToInt(Pixel::getRed).average().getAsDouble()));
                pixel.setGreen((int) (others.stream()
                    .mapToInt(Pixel::getGreen).average().getAsDouble()));
        })
    );
}

Some languages offer a parallel for-loop for a series of integers. Java doesn't seem to, but I don't feel like multithreading the "correct way" (like fork-join, etc.)

Is this efficient? I have found that this is indeed faster than the standard for (int i ... code. Which of the loops (streams) should I make parallel? Is this good coding practice?

As an additional question, would this be premature optimisation? If it doesn't matter if it takes 2 seconds rather than 1, why go through the trouble of making it run fast at the cost of clarity? — Bartvbl, May 19 '14 at 17:23
@Bartvbl No. It's a matter of 10 seconds instead of 100. Besides, I don't really find it premature _per se_. — Simon Kuang, May 19 '14 at 17:25
wow, yes. That definitely makes it worth it. Never mind me :) — Bartvbl, May 19 '14 at 17:27
I'm inclined to say that you should stick with the regular for-loops over the new stream approach as it only adds clutter. Now I do realise that you are aiming for (parallel) effiency, so I would suggest you to use the fork-join framework for this, because currently it looks quite messy. — skiwi, May 19 '14 at 17:29
This looks exactly like what you want to implement: http://docs.oracle.com/javase/tutorial/essential/concurrency/forkjoin.html (At very least a good starting point) — skiwi, May 19 '14 at 17:36
Few days back I tried this and I found that its efficient, only from the perspective of carrying out some calculation independently i.e one calculation doesn't affect the other one in such cases we can use it.If you want further information on streams I suggest you take a look at the video from VJUG for streams work shop. Again it internally uses the fork join at the end of the above mentioned link its stated. — Bilbo Baggins, May 19 '14 at 17:39
@skiwi I see. It looks complicated though. Isn't this much shorter for such a simple task? — Simon Kuang, May 19 '14 at 17:39
Perhaps making tuples of (x, y) coordinates and processing them in parallel would work and make it nicer? I'll try to come up with something. — skiwi, May 19 '14 at 17:41
@skiwi Like `parallelStream()` on a bunch of special containers? That's cool (I just thought it was too much work). Thanks! — Simon Kuang, May 19 '14 at 17:46
Have you seen this other example on SO: http://stackoverflow.com/questions/23489993/nested-java-8-parallel-foreach-loop-perform-poor-is-this-behavior-expected — edharned, May 19 '14 at 18:06
`new Picture(this)`: Is the whole image copied into the new object? Is the blur method usually applied to the whole image, or only to a small part of the image? — nosid, May 19 '14 at 19:36
@nosid The whole image. It copies the image first, or else blur isn't quite correct. — Simon Kuang, May 19 '14 at 20:00

score 2 · Accepted Answer · answered May 19 '14 at 21:40

If performance really matters. you should focus on:

inner loops,
and memory locality.

In particular, the latter depends on the layout of pixels in memory. It can have a significant impact on performance, whether they are aligned row-by-row or column-by-column (e.g. due to false sharing). For this reason, I recommend using explicit parallelisation.

Imagine you have a method that is optimized for sequential execution:

void blurSequential(Picture source, int x, int y, int w, int h);

Then you can easily divide the image into tiles, and execute the sequential method on each tile independently. The following code shows how this can implemented. You have to replace the async and await pseudo instructions with an async execution mechanism, e.g. with ExecutorService and Future.

void blurParallel(Picture source, int x, int y, int w, int h) {
    int processors = Runtime.getRuntime().availableProcessors();
    blurParallel(source, x, y, w, h, processors * 4);
}

void blurParallel(Picture source, int x, int y, int w, int h, int parallelism) {
    if (parallelism <= 1) {
         blurSequential(source, x, y, w, h);
    } else if (w >= THRESHOLD_WIDTH) {
         int m = w / 2;
         async blurParallel(source, x, y, m, h, parallelism / 2);
         blurParallel(source, x + m, y, w - m, h, parallelism / 2);
         await
    } else if (h >= THRESHOLD_HEIGHT) {
         int m = h / 2;
         async blurParallel(source, x, y, w, m, parallelism / 2);
         blurParallel(source, x, y + m, w, h - m, parallelism / 2);
         await
    } else {
         blurSequential(source, x, y, w, h);
    }
}

score 0 · Answer 2 · answered May 19 '14 at 18:05

First of all, I think this code is too cluttered and thus hides what it is actually doing. Hence I propose the following code, with as remainder that streams are not the solution to everything.

private List<Pixel> getNeighbours(final Picture picture, final IntTuple intTuple) {
    List<Pixel> pixels = new ArrayList<>();
    for (int x = intTuple.x - 1; x <= intTuple.x + 1; x++) {
        for (int y = intTuple.y - 1; y <= intTuple.y + 1; y++) {
            pixels.add(picture.getPixel(x, y));
        }
    }
    return pixels;
}

private void averagePixel(final Picture picture, final IntTuple intTuple) {
    double red = 0d;
    double green = 0d;
    double blue = 0d;
    List<Pixel> neighbours = getNeighbours(picture, intTuple);
    for (Pixel pixel : neighbours) {
        red += pixel.getRed();
        green += pixel.getGreen();
        blue += pixel.getBlue();
    }
    Pixel pixel = picture.getPixel(intTuple.x, intTuple.y);
    pixel.setRed((int)(red / neighbours.size()));
    pixel.setGreen((int)(green / neighbours.size()));
    pixel.setBlue((int)(blue / neighbours.size()));
}

IntStream.range(x, x + w)
        .parallel()
        .boxed()
        .flatMap(xVal -> IntStream.range(y, y + h).parallel().mapToObj(yVal -> new IntTuple(xVal, yVal)))
        .forEach(tuple -> averagePixel(picture, tuple));

What this does is:

Obtains a parallel IntSream over the width.
Then boxes it to a Stream<Integer> as there is no flatMapToObj. This is somewhere where I believe you could gain performance with a ForkJoinPool.
Then performs a flat map operation, that maps every x-value to an (x, y) tuple as follows:
1. Obtain a parallel IntStream over the height.
2. Map the IntStream to the IntTuple. Note that here you still have access to the x value at the start of the flat map operation.
Now you have a Stream<IntTuple>.
Finally you apply the averagePixel operation on all tuples.

Special notes for the other methods:

getNeighbours uses a double loop over the x and y coordinates to avoid handcoding it.
averagePixel does not use streams such that it can calculate the red, green and blue values in one go. I don't think you could do it this way with a stream.

I hope this helps, please note though that this code is untested and may contain implementation errors.

score 0 · Answer 3 · answered May 19 '14 at 22:19

I do not know the class Picture.

When using BufferedImage, Image, Raster one may get entire arrays of pixel values, for instance BufferImage.getRGB. I assume that these underlying data is used in Picture.

Calculating an average on the list others is not needed: one may immediately sum & increase a counter. Sketchy:

int counter = 0;
int sum = 0;
if (j > 0) {
    sum += source.getPixel(i, j - 1);
    ++counter;
}
...
int blurred = sum / counter;

The overhead of always a new data structure (int[9] or so), and doing paralel is not worth it.

Parallel IntStream hack

3 Answers3