Multi-threaded ByteBuffers slower than sequential?

Question

I have a huge byte array that needs to be processed. In theory, it should be possible to slice the work into even pieces and assign them to different threads to increase performance on a multi-core machine.

I allocated a ByteBuffer for each thread and processed parts of the data each. The final performance is slower than with a single thread even though I have 8 logical processors. Also it is very inconsistent. Sometimes the same input is double as slow to process or more. Why is that? The data is loaded into the memory first so no more IO operations are performed.

I allocate my ByteBuffers using MappedByteBuffer because it's faster than ByteBuffer.wrap():

public ByteBuffer getByteBuffer() throws IOException
{
    File binaryFile = new File("...");
    FileChannel binaryFileChannel = new RandomAccessFile(binaryFile, "r").getChannel();

    return binaryFileChannel.map(FileChannel.MapMode.READ_ONLY, 0, binaryFileChannel.size());
}

I do my concurrent processing using Executors:

int threadsCount = Runtime.getRuntime().availableProcessors();
ExecutorService executorService = Executors.newFixedThreadPool(threadsCount);
ExecutorCompletionService<String> completionService = new ExecutorCompletionService<>(executorService);

for (ByteBufferRange byteBufferRange : byteBufferRanges)
{
    Callable<String> task = () ->
    {
        performTask(byteBufferRange);

        return null;
    };

    completionService.submit(task);
}

// Wait for all tasks to finish
for (ByteBufferRange ignored : byteBufferRanges)
{
    completionService.take().get();
}

executorService.shutdown();

The concurrent tasks performTask() use their own ByteBuffer instances to read memory from the buffer, do calculations and so on. They do not synchronize, write or influence each other. Any ideas what is going wrong or is this not a good case of parallelization?

The same problem exist with ByteBuffer.wrap() and MappedByteBuffer alike.

Mapped buffers are not really files loaded into memory. The OS dynamically maps chunks (pages) of the file contents into memory when you read it and swaps the data out for other data once you read in a different place. Means you use very little actual memory while it can appear as if you had terrabytes in memory. But also means that jumping around can require re-reading from disk. — zapl, Jun 03 '16 at 20:20
This isn’t really a question about parallelizing ByteBuffers. It’s a question about parallelizing the reading of a single file. I expect that adjacent sections of file data are more likely to be next to each other on the media, and therefore slightly faster to access sequentially than seeking to various places all over the file. — VGR, Jun 03 '16 at 20:34
There's no particular reason why multithreading it should make it faster. The disk isn't multi-threaded. — user207421, Jun 04 '16 at 00:29
@EJP disk access is not CPU bound, and disk accesses are queued and pipelined, so multithreading can keep that hardware pipeline busier and therefore improve overall disk throughput. — weston, Jun 16 '16 at 15:36

AngerClown · Accepted Answer · 2016-06-04T02:05:21.240

As @EJP mentioned, the disk isn't really multi-threaded, though an SSD may help. The point of mapping the buffer is so you don't have to manage the memory yourself; let the OS do it since its virtual memory manager and file system cache are going to be faster than moving it into Java's heap and probably faster than any memory management code you write.

If the processing really can be parallelized, you will probably be better off having a single thread read the entire file, breaking it into chunks (possibly in some intermediate data format), then having your executors work on these chunks. The file reading thread can run concurrently with the other threads, so you don't need to read the whole file to start processing.

You may want to try setting the number of executors to cores - 1 so you don't starve the file reading thread. That would give the OS a chance to keep the file reading thread running on a single core without context switching so you will get good IO performance while using the other cores to do CPU intensive work.

FYI, this is what Apache Spark is built for. You may want to look at that if you need to work with larger files or need to process faster than what a single system can do.

Multi-threaded ByteBuffers slower than sequential?

1 Answers1