5

Hello fellow programmers. I have already asked one question, but despite the really good answers I've got I couldn't fix my problem. Then, I took the time to refactor my code in such a way that would improve its parallelization potential (by having less calculation batches with more calculation duty each). But still I can't have a better performance than serial processing.

I suspect this slow parallel processing is due to the context switching. Or maybe it's due to "automatic" synchronization of common objects. I think you can help me understand what's going on.

Let me state my case: I'm making a program for scientific calculations. It does not depends on external things, just on the input values I give to it at its start. The size of this problem can be measured by Ns (which is the name I use). It can be seen as the "resolution" of the solution, it is one of the user inputs, and usually is of the order of 100.

In such way, I have several double arrays in my main class such as double ys[Ns][N] or phiS[Ns][Nord][N], where N and Nord are other fixed magnitudes of the program. In my program, I have to calculate several things for each one of the Ns points and here comes the parallelization. Each point calculation is independent, so I can divide them to different threads and hope it gets faster.

So, instead of having a loop for (int i=0; i<Ns; <i++) I divided this calculation duty into Runnable batches, each one ranging inside a smaller interval: for (int i=start; i<end; i++), where start and end are allways between 0 and Ns. For example, if I'm on a dual core pc, I make two batches, one with start = 0 and end = Ns/2, the other with start = Ns/2 and end = Ns. If I'm on a quad core, the second batch will have start = Ns/4 to end = Ns/2 and so on (assuming the division is exact at every case).

Each Batch, as a class that implements Runnable, is stored in a ArrayList<Batch> and is given to a FixedThreadPool with size equal to the number of cores. It execute the batches and waits for them to finish using a simple CountDown scheme.

Each of this batches needs to access the data on those arrays from the main class of the program, but their access is such that each batch only reads from yS[start][] to yS[end][] and therefore two batches will never try to read the same array element. I wonder if Java still locks up yS, even that each batch isn't trying to access the same elements as others.

I wonder also if my problem is related to the overhead due to context switching, as each batch needs to deal with thousands of doubles, and if the way that the program is built can affect it.

Maybe I should find a way to pass to each batch just the elements of the arrays that are relevant to it, but I wouldn't know how to approach this. If there were pointers, I could have new arrays of just the desired elements with simple pointer operations and without reallocating anything. Is there a way to do such a thing in Java?

Well, finally, just to mention: There is one part of the code that needs to be synchronized (it deals with other arrays) and it is already working fine. This calculation duties I've described above aren't the only thing my program does. They are inside a loop, alternating with sequential processing parts, but are really significant as the total execution time.

So, to summarize, the question is: why I'm not gaining with multithreading, when I was expecting to?

I've just run here a couple of times the plain serial and the multithread program and got 14500 ms for the serial and 15651 ms for the multithread. Both on the same Dual Core. Other point to notice: In serial run, each calculation duty (from 0 to Ns) takes around 1.1 to 4.5 ms. From the dual threading, each batch (Ns/2 points) takes around 0.5 to 3 ms; (measured from the the top to bottom of the run() method. Each time of calculation duty differs by it's own numerical convergence)

Thanks very much for the attention.

Community
  • 1
  • 1
ursoouindio
  • 113
  • 7

4 Answers4

2

One possible you may be running in to is threads thrashing over cache lines. If different threads rapidly write to locations in the same cache line (for instance, close in the same array), then the hardware has a high communication overhead ensuring that the data remains consistent.

Tom Hawtin - tackline
  • 145,806
  • 30
  • 211
  • 305
  • A correct answer but I doubt the OP understand the cache lines. – bestsss Feb 24 '11 at 14:41
  • maybe... but when one thread reads yS[0], the other reads yS[50]. Then, the first reads yS[1] and the other yS[51] and so on... – ursoouindio Feb 24 '11 at 14:46
  • @ursoouindio, wow you have multi-threaded application not serializated one. To put it simply: there is no 'then' part. Double data type yS[0] and ys[7] will be on the same cache line for bytes (x86_64 cache lines are 64bytes). To answer your puzzle: split the array equally into equal parts. I.e. if you have int[2048], the 1st start off at index 0, and the 2nd at index 1024. Never, ever do close indexing in a multi-threaded application. {i will look for some good article on fake isolation) – bestsss Feb 24 '11 at 14:53
  • Ouch! Sorry, @bestsss. You're right about the then part hehehe. But now I follow the cache thing, but have no idea how to treat it. If you find a nice article about it, let me know! Thanks – ursoouindio Feb 24 '11 at 15:32
  • Here: 2 articles. It is not java, yet the language (Java/C/C#) does not make any difference. http://www.bluebytesoftware.com/blog/2009/10/20/FalseSharingIsNoFun.aspx (the autor is a .NET architect) and the norm - wikipedia: http://en.wikipedia.org/wiki/False_sharing however, it's a weak article. – bestsss Feb 24 '11 at 16:31
  • I will take a look, although my tests according to @Danish suggestion seem to show that it is not the problem, since the results were the same when I make a local copy of every double inside each runnable. – ursoouindio Feb 24 '11 at 16:50
2
 I wonder if Java still locks up yS, even that each batch isn't trying to access
 the same elements as others.

There is no automatic synchronization or locking in Java. You have to explicitly code that.

I wonder also if my problem is related to the overhead due to context switching..

Context switches do have overhead. If all your threads work on the same task, which is CPU-intensive, then your number of threads should equal to number of available processor cores.

If there were pointers, I could have new arrays of just the desired elements with
simple pointer operations and without reallocating anything.

All objects in Java are passed by reference (for example when you pass them to a method). And basically all references are pointers (with a difference that you can not dereference them). So no objects are copied in Java, except when explicitly requested by your code.

That being said, you should be aware of another thing: If you are adding a lot of elements to Collections (Lists, HashMaps, etc..) than this Collections need to grow. Internally all Collections use arrays to store elements, and when elements are added the arrays need to be resized. As there is no way to resize an array in Java, there needs to be created a new array and all references to old objects copied to a new array. Or if you use primitive types all data needs to be copied. So, when creating Collections you should size them to appropriate size so that they wouldn't need to be resized.

You may also like to read How many threads should I use in my Java program?

Community
  • 1
  • 1
Peter Knego
  • 79,991
  • 11
  • 123
  • 154
  • 1
    Pass by reference: http://stackoverflow.com/questions/40480/is-java-pass-by-reference – Nathan Feger Feb 24 '11 at 14:37
  • thanks, @Peter, but it is clear to me that optimal number of threads is the number of processing cores. I'm using it that way. And thanks for the enlightening on the automatic locking. – ursoouindio Feb 24 '11 at 14:40
  • When i mentioned pointers, I was thinking that it could be have a smaller context switching overhead if I passed arrays that starts on `yS[start][]` and end on `yS[end][]`elements, instead of the whole yS[][]... It makes sense or makes no difference? – ursoouindio Feb 24 '11 at 14:44
  • Makes no difference as you are only passing a reference to array, not the array itself. So only a new reference is created, but not a new array. – Peter Knego Feb 24 '11 at 15:01
  • I see. Thanks for this information. I thought that maybe that every runnables seeking for the same array be the reason for the lack of performance improvements. – ursoouindio Feb 24 '11 at 15:17
1

Based on what you've mentioned so far, I would try the following things

  1. Compare results between the serial and the parallel version for increasing sizes for your arrays. Difference in performance may indeed be insignificant for your problem size and may only show itself once the size gets bigger i.e. size of the arrays

  2. Give each runnable its own copy of the array. In light of performance, the way the array is laid out in memory and how you access them can gave a effect on performance. Even though you may have a 2D array, its going to be laid out as a concurrent list of arrays serially in memory. Hence, if you share this array between runnables, it may become inefficient for some of them.

Danish
  • 3,708
  • 5
  • 29
  • 48
  • The first i've tested. So far, it justs starts to get faster at much higher Ns values than I need, but I believe it should show improvements even for smaller Ns... – ursoouindio Feb 24 '11 at 14:53
  • The second: do you mean to pass an reference of the arrays (the usual thing), and then storing it inside the runnable (like with this.yS = ySpassedFromArgument)? Or to make a local copy, at each time the calculation duty appears, inside local variables of the runnable? – ursoouindio Feb 24 '11 at 14:56
  • Yes, I meant to suggest a local independent copy. The copying will hit the performance in a negative way, but may get offset if the smaller copy gets laid out in memory more efficiently for each runnable – Danish Feb 24 '11 at 15:01
  • Searching around (http://www.ibm.com/developerworks/java/library/j-5things4.html?ca=dat-), I've found those CopyOnWriteArrayLists, but I didn't figured out exactly how they really work. Is it the case? – ursoouindio Feb 24 '11 at 15:03
  • Worths a try. Thanks, @Danish. – ursoouindio Feb 24 '11 at 15:05
  • It all depends on what you're doing with your arrays. From your description, it looks like you're mostly reading from them. Hence, I'm not too sure if CopyOnWriteArrayLists will help here too much. CopyOnWriteArrayLists makes reading and writing to arrays thread safe by making a copy everytime its modified. – Danish Feb 24 '11 at 15:14
  • I mostly read, but I do write on them. But not so often as I read. – ursoouindio Feb 24 '11 at 15:28
  • I tried your suggestion, made a local copy of just the needed data inside each runnable. The result got even for serial and dual threading (the same as before). I think on testing this on a quad core, but I can't do it right now. – ursoouindio Feb 24 '11 at 16:40
  • Having a local copy of the values manes, at each time I need to start the multithreading, to allocate space for 25610 new doubles (with the current specs on my program) and to actually copy 5130 of those. The rest starts from zero anyway and get updated during the calculation duty. Any idea of the ammount of overhead over this kind of work? – ursoouindio Feb 24 '11 at 16:48
  • Just tested with the double of that Nc used and, on your suggestion the both programs got evened at ~53000. The big winner is the serial program without local variables, with ~49000. His dual threading brother got the same ~53000. – ursoouindio Feb 24 '11 at 17:07
-1

do you have sufficient memory available to create multiple collections and pass a unique collection of work to each thread, this way you can absolutely take the contention of multiple threads accessing the same memory out of your mind?

Nathan Feger
  • 19,122
  • 11
  • 62
  • 71
  • Yes, I think the memory will be enough. But as this program is just a part to be used in greater applications I would like to keep its memory usage small... But worth trying. How do you suggest to approach this? – ursoouindio Feb 24 '11 at 14:52
  • This would make a processor cache issue much worse. The whole catch in high-performance computing is to prevent cache-flushing and reloading. To minimize that you want to minimize the set of data you operating on and avoid having duplicate copies lying around. – Peter Knego Feb 24 '11 at 15:07
  • @Peter, actually I did something similar (following the suggestion by Danish) and the problem hasn't got much worse, but the performance haven't changed. – ursoouindio Feb 24 '11 at 20:58
  • @Peter I'll buy that, having multiple pointers into the same section of memory certainly seems like it would be more likely to be closer to the chip, since more stuff is requesting it. – Nathan Feger Feb 25 '11 at 02:54