2

I have a Java program for doing a set of scientific calculations across multiple processors by breaking it into pieces and running each piece in a different thread. The problem is trivially partitionable so there's no contention or communication between the threads. The only common data they access are some shared static caches that don't need to have their access synchronized, and some data files on the hard drive. The threads are also continuously writing to the disk, but to separate files.

My problem is that sometimes when I run the program I get very good speed, and sometimes when I run the exact same thing it runs very slowly. If I see it running slowly and ctrl-C and restart it, it will usually start running fast again. It seems to set itself into either slow mode or fast mode early on in the run and never switches between modes.

I have hooked it up to jconsole and it doesn't seem to be a memory problem. When I have caught it running slowly, I've tried connecting a profiler to it but the profiler won't connect. I've tried running with -Xprof but the dumps between a slow run and fast run don't seem to be much different. I have tried using different garbage collectors and different sizings of the various parts of the memory space, also.

My machine is a mac pro with striped RAID partition. The cpu usage never drops off whether its running slowly or quickly, which you would expect if threads were spending too much time blocking on reads from the disk, so I don't think it could be a disk read problem.

My question is, what types of problems with my code could cause this? Or could this be an OS problem? I haven't been able to duplicate it on a windows a machine, but I don't have a windows machine with a similar RAID setup.

  • try running with "java -server", maybe it randomly chooses not to use JIT? And how can you make sure that different threads run on different processors/cores? – Denis Tulskiy Oct 30 '09 at 20:33
  • 1
    @Piligrim, if he's using Mac OS X Snow Leopard which comes with 64-bit Java 6 by default, he is already using the server version. And it would be really strange if the JVM would randomly choose not to use the JIT - that's not a realistic scenario... – Jesper Oct 30 '09 at 22:15
  • Yeah I have been using -d64 which apparently supercedes -server. – javajustice Oct 30 '09 at 22:18
  • I would try and proceed with the profiler option. Once connected it should identify the problem within minutes of testing. – Pool Oct 30 '09 at 23:24
  • I was able to duplicate the slowdown and connect visualvm. It was running slowly when I connected, but shortly after connecting with the profiler, it sped back up! It's never done that before. I think it must somehow be getting into a degenerate case with the garbage collector; connecting the profiler knocked it out of the cycle. Anyone have any idea what this could be? I'm using min and max heap sizes of 16 gigs. – javajustice Oct 31 '09 at 00:26

4 Answers4

1

You might have thread that have gone into an endless loop.

Try connecting with VisualVM and use the Thread monitor.

https://visualvm.dev.java.net

You may have to connect before the problem occurs.

Fedearne
  • 7,049
  • 4
  • 27
  • 31
  • Im pretty sure it's not going into an infinite loop because even when the program runs slowly, it still does finish and give the correct output. – javajustice Oct 30 '09 at 22:19
  • But thank you I will try visual vm and see if it shows anything. – javajustice Oct 30 '09 at 22:22
  • I have tried looking at the threads in visualvm, and none of them are blocking. It says they are all running fine. If I do cpu profiling the results are odd.. it only updates sporadically and gives nonsensical results whether the program is running fast or slow. Running the cpu profiler does always, without fail, knock the program out of "slow mode" though. – javajustice Nov 02 '09 at 20:50
  • Just to be clear, loading it in visualvm shows all of the computation threads as 100% "green" with no time spent sleeping, waiting, or in monitor contention. Also, the garbage collector usage is ~0%. – javajustice Nov 02 '09 at 21:26
1

I second that you should be doing it with a profiler looking at the threads view - how many threads, what states are they in, etc. It might be an odd race condition happening every now and then. It could also be the case that instrumenting the classes with profiler hooks (which causes slowdown), sortes the race condition out and you will see no slowdown with the profiler attached :/

Please have a look at this post, or rather the answer, where there is Cache contention problem mentioned.

Are you spawning the same umber of threads each time? Is that number less or equal the number of threads available on your platform? That number could be checked or guestimated with a fair accuracy.

Please post any finidngs!

Community
  • 1
  • 1
diginoise
  • 7,352
  • 2
  • 31
  • 39
  • The program takes as input an n-dimensional parameter space, and divides it into a constant given number of chunks one for each thread. In this case I'm using 15 chunks since I have 16 logical processors. The threads are almost completely independent. They read from the same set of data files but each with their own channel, and write out to separate data files (one for each point in the parameter space). The only shared memory is some static arrays of constants that begin uninitialized. When one of the threads tries to look up the constant it first checks if it's been ... – javajustice Nov 02 '09 at 20:36
  • ... calculated and if not it calculates it and puts it in the array. So here multiple threads would be accessing the arrays simultaneously, but the access isn't synchronized and all modifications to the arrays are atomic. I'm essentially running what could be run in 15 separate processes in one process for convenience sake. – javajustice Nov 02 '09 at 20:39
1

Do you have a tool to measure CPU temperature? The OS might be throttling the CPU to deal with temperature issues.

Jon Bright
  • 13,388
  • 3
  • 31
  • 46
  • That is interesting. Could it have something to do with the TurboBoost stuff in the new Nehalem chips? It is very strange that it slows down while still showing the same level of cpu usage in top. – javajustice Nov 02 '09 at 20:42
  • Though if it were throttling because of temperate, I'd expect it to sometimes slow down a process that is running fast, or vice versa. This never happens. The process is always stuck either fast or slow from the beginning, and only connecting visualvm and starting a cpu profile can knock it out of slow mode. – javajustice Nov 02 '09 at 21:28
1

Is it possible that your program is being paged to disk sometimes? In this case, you will need to look at the memory usage of the operating system as whole, rather than just your program. I know from experience there is a huge difference in runtime performance when memory is being continually paged to the disk and back.

I don't know much about OSX, but in linux the "free" command is useful for this purpose.

Another issue that might cause this slowdown is log files? I've known at least some logging code that slowed down the system incrementally as the log files grew. It's possible that your threads are synchronizing on a log file which is growing in size, then when you restart your program, another log file is used.

erg0sum
  • 11
  • 1
  • I have tried to be conscious of this issue.. my machine has 32G of memory and I limit the process to using 16G max heap, and the process generally never gets above 8G. So it shouldn't be paging, but next time I reproduce the bug I will find something to monitor the swap usage. – javajustice Nov 02 '09 at 20:28
  • I reporduced the issue and it is definitely not hitting the page file, at least according to the mac os activity monitor. – javajustice Nov 02 '09 at 21:24