Code in the thread pool runs much slower than unthreaded

Question

I have the code which reads a set of binary files which essentially consist from a lot of serialized java objects. I'm trying to parallelize the code, by running the reading of the files in the thread pool ( Executors.newFixedThreadPool )

What I'm seeing is that when threaded, the reading runs actually slower than in a single thread -- from 1.5 to 10 time slower,depending on the number of threads.

In my test-case I'm actually reading the same file (35mb) from multiple threads, so I'm not bound by I/O in any way. I do not run more threads than CPUs and I do not have any synchronization between pools -- i.e. I'm just processing independently a bunch of files.

Does anyone have an idea what could be a possible reason for this slow performance when threaded ? What should I look for ? Or what's the best way to dissect the problem? I already looked for static variables in the classes, which could be shared between threads and I don't see any. Can one of the java.* classes when instantiated in the thread run significantly slower, (e.g. java.zip.deflate which I'm using)?
Thanks for any hints.

Upd: Another interesting hint is that when a single thread is running the execution time of the function which does the reading is constant to high precision, but when running multiple threads, I see significant variation in timings.

Are you saying that even with a `Executors.newFixedThreadPool(1)` that your code runs slower than if it is called directly -- not in a pool? I find that hard to believe. — Gray, May 15 '12 at 15:33
"I'm actually reading the same file (35mb) from multiple threads, so I'm not bound by I/O in any way" - Well yes you are: you only have one disk so if you read it from multiple threads it is unlikely to go quicker and will most likely be slower... Or maybe I misunderstood what you asked for? — assylias, May 15 '12 at 15:35
No, I'm saying that when I run Executors.newFixedThreadPool(n) and n>=2, the code executed by a thread always runs slower than when I use Executors.newFixedThreadPool(1). — sega_sai, May 15 '12 at 15:36
Just because all thread workers are reading from same file does not mean for sure it is not IO bound. It might be. It might not be. To be sure, setup your test case so that all thread workers are reading from a file in memory vs. off disk. — kaliatech, May 15 '12 at 15:36
My file has a size of 35meg, and it is in the OS cache anyway, so I don't see how it could possibly be I/O bound. Also I'm seeing that there is no disk activity at all (I'm using linux btw). — sega_sai, May 15 '12 at 15:39
What classes are yo uusing for reading ? Old io classes or nio classes? Can you paste some io code? — Drona, May 15 '12 at 15:41
Well, what is the CPU use? If it's 100%, we can assume that you are not I/O bound. If it's 6%, then... — Martin James, May 15 '12 at 15:42
The IO classes I'm using are InputStream, DataInputStream, SequenceInputStream etc, which seems to be java.io (not new classes) — sega_sai, May 15 '12 at 15:44
The CPU use is not 6% but it is less than nthreads times 100%, so my current guess is that there is some locking somewhere going on... — sega_sai, May 15 '12 at 15:45

score 2 · Answer 1 · answered May 15 '12 at 15:44

Sounds to me like you are expecting a java.zip.deflate read of 35mb to run faster when you add multiple threads doing the same job. It won't. In fact, although you may not be IO bound, you are still incurring kernel overhead with each thread that you add -- buffer copies, etc.. Even if you are reading entirely out of kernel buffer space, you incur CPU and processing overhead.

That said, I am surprised that you incur 1.5 to 10 times slower. If each of your processing threads is then writing output then obviously that won't be cached.

However I suspect that you may be incurring memory contention. If you are handling a Java serialized object stream, you need to watch your memory consumption unless you are resetting it often. Serialization keeps a lot of references around to objects so that large contiguous streams can generate a tremendous amount of GC bandwidth.

I'd connect to your program using jconsole and watch the memory tab closely. As the survivor and old-gen spaces fill you will see non-linear CPU implications.

I'm not expecting java.zip.deflate to run *faster*, when running multiple threads. I'm expecting it to run exactly as fast as in the single thread. Thanks for the memory/jconsole advise. I'll check it. — sega_sai, May 15 '12 at 16:00
It would be interesting to _just_ doing the IO in each of your threads -- comment out the zip processing. Next just do the zip processing but without the serialization. See if you can see where the deflection point is. — Gray, May 15 '12 at 16:04
Thanks, I'll try that, although the I/O reading pattern depends on what's actually read from the file, so it may be hard/impossible to completely separate I/O from serialization. — sega_sai, May 15 '12 at 16:08

score 0 · Answer 2 · edited May 23 '17 at 12:20

0

Just because all thread workers are reading from same file does not mean for sure it is not IO bound. It might be. It might not be. To be sure, setup your test case so that all thread workers are reading from a file in memory vs. off disk.

You mentioned above that you believe the OS has cached the file, but do you know for sure if the file is being opened in read-only/shared mode? If not, then the OS could still be locking the file to insure only one thread has access at a time.

Potentially related links:

edited May 23 '17 at 12:20

Community

1
1

answered May 15 '12 at 15:45

kaliatech

17,579
5
72
84

Thanks for the suggestion, but I don't think locking of the files is what's happening, because before changing my test to read the same file multiple times, I've had different files and the picture was the same. (And that was tested on a very powerful machine with tons of RAM and very good RAID), so it is simply impossible to believe in ~ 60 seconds reading of the 35 meg file. – sega_sai May 15 '12 at 15:51

sega_sai · Accepted Answer · 2012-05-20T13:53:10.850

The problem was caused by java.util.zip.Inflate class which actually has lot of synchronized methods (because several of them use native code), so when multiple threads are being run, the synchronized methods are competing with each other and making the code very close to sequential.

The solution was to replace the java.util.zip classes by the java only version from GNU classpath (e.g. from here http://git.savannah.gnu.org/cgit/classpath.git/tree/java/util/zip)

Code in the thread pool runs much slower than unthreaded

3 Answers3