Comparing c# , c++ and java performance ( Strange behavior of c# )

Question

i am implementing Floyd–Warshall algorithm with c++ , c# and java. in each language i use sequential and parallel implementation after testing the result was :
(Elapsed time is only for main Function and Reading Files , Inti of Variable and ... are not measured.)
download sources here SourceCodes

c++

IDE: Netbeans
Compiler: MinGW-4.8.1
sequential time : 9.333000
Parallel time : 2.539000
using OpenMp for Prallel

if NumOfThreads=1 then is Sequential else is Parallel

variables

#define n 1000 /* Then number of nodes */
double dist[n][n];

    void floyd_warshall(int NumOfThreads) {
    int i, j, k;
         omp_set_num_threads(NumOfThreads);
    for (k = 0; k < n; ++k)
       #pragma omp parallel for private(i,j)
        for (i = 0; i < n; ++i)
            for (j = 0; j < n; ++j)
                    if ((dist[i][k] * dist[k][j] != 0) && (i != j))
                     if ((dist[i][k] + dist[k][j] < dist[i][j]) || (dist[i][j] == 0))
                        dist[i][j] = dist[i][k] + dist[k][j];    }

java

IDE: Netbeans
Compiler: Netbeans default
sequential time : 11.632
Parallel time : 3.089
-Xms512m -Xmx1024m
import java.util.concurrent.Callable;
import java.util.concurrent.ExecutionException;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Future;

java variables

 public final int numNodes =1000;

    public final double[][] costs= new double[numNodes][numNodes] ;

i did not put java code here because its working right ( i think )

c#

IDE: Visual Studio 2012
Compiler: Visual Studio 2012 default
sequential time : 31.312
Parallel time : 8.920
using System.Threading.Tasks;

variable

  const int n = 1000;
    static double[,] dist = new double[n, n];

parallel code

   static  void floyd_warshall(ParallelOptions pOp)
    {
        int k;     
        for (k = 0; k < n; ++k)
            Parallel.For<int>(0, n, pOp, () => 0, (i, loop, j) =>
                {
              for (j = 0; j < n; ++j)
               if ((dist[i, k] * dist[k, j] != 0) && (i != j))
                  if ((dist[i, k] + dist[k, j] < dist[i, j]) || (dist[i, j] == 0))
                            dist[i, j] = dist[i, k] + dist[k, j];
                    return 0;
                }, (j) => { });

single code

 static void floyd_warshallSingle()
  {
      int i, j, k;
      for (k = 0; k < n; ++k)
          for (i = 0; i < n; ++i)
              for (j = 0; j < n; ++j)

                  if ((dist[i,k] * dist[k,j] != 0) && (i != j))

                      if ((dist[i,k] + dist[k,j] < dist[i,j]) || (dist[i,j] == 0))
                          dist[i,j] = dist[i,k] + dist[k,j];
  }

whats wrong with my c# implementation ?
all use the same file
now my question is why it takes more time to solve this algorithm with c# ? java and c++ have nearly the same elapsed time but i think my implementation with c# is wrong because this difference between c# and c++ is strange !
please help me to improve my C# implementation or name some reasons Thank you !

edit 1

i changed my arrays to jagged array and the result changed to :

sequential time : 19.22
Parallel time : 4.903

still its a huge different between c# and c++ or java ! any idea why ?
new variables

const int n = 1000;
    static double[][] dist = new double[n][];

new codes :

static void floyd_warshallSingle()
  {
      int i, j, k;
      for (k = 0; k < n; ++k)
          for (i = 0; i < n; ++i)
              for (j = 0; j < n; ++j)

                  if ((dist[i][k] * dist[k][j] != 0) && (i != j))

                      if ((dist[i][k] + dist[k][j] < dist[i][j]) || (dist[i][j] == 0))
                          dist[i][j] = dist[i][k] + dist[k][j];
  }



   static  void floyd_warshall(ParallelOptions pOp)
    {
        int k;     
        for (k = 0; k < n; ++k)
            Parallel.For<int>(0, n, pOp, () => 0, (i, loop, j) =>
                {
                    for (j = 0; j < n; ++j)
                        if ((dist[i][k] * dist[k][j] != 0) && (i != j))
                            if ((dist[i][ k] + dist[k][j] < dist[i][ j]) || (dist[i][j] == 0))
                                dist[i][ j] = dist[i][k] + dist[k][j];

                    return 0;
                }, (j) => { });
    }

You're using jagged arrays in C++ and Java, but two-dimensional arrays in C#. The latter are known to be slower in some scenarios; try converting your C# implementation to jagged arrays too. — Douglas, Nov 22 '14 at 16:34
@Douglas I don't think so. His array in C++ certainly isn't jagged. And in Java all 2D arrays are an array of pointers to each row under the hood (which I think is what you mean by jagged, since none of the code he shows has anything that is really jagged). Finally, on most modern machines, a flat layout, like he has in C++, will be considerably faster than an array of pointers to rows, because of significantly better locality. — James Kanze, Nov 22 '14 at 17:00
@JamesKanze: The term "[jagged arrays](http://msdn.microsoft.com/en-us/library/2s05feca.aspx)" is used loosely to refer to arrays of arrays. There are a lot of articles confirming that jagged arrays perform faster than two-dimensional ones in C# due to optimizations by the compiler and the CLR; see [Skeet's post](http://stackoverflow.com/a/468873). — Douglas, Nov 22 '14 at 17:10
To clear up the confusion: There is `int[][]` (an array of `int[]`) and `int[,]` (a flat memory allocation) in C# - the former being generally referred to as jagged. The fact that `int[][]` is/was apparently faster in C# than `int[,]` just means that somewhere the compiler/CLR is just doing a horrible job. In the case of linear accesses both should be about equally fast, but if you do random accesses on a jagged array you get a second cache miss that could be avoided. But there's never a reason for a jagged array to be faster than a flat array. — Voo, Nov 22 '14 at 17:44
@Voo both are slower than c++ and java ! look at my edit 1 :) — , Nov 22 '14 at 17:48
@Voo: The array layout has little effect on cache locality. Caches operate at the granularity of cache lines (typically 64 bytes), not objects. Unless your rows are small enough to fit into a cache line (i.e. have 16 ints or less), it won't make much of a difference. Jagged arrays do require additional cache memory to hold the array pointers, but this becomes insignificant when the rows are large. — Douglas, Nov 22 '14 at 17:57
Also, there is a reason why two-dimensional arrays may be slower: they require multiplication to compute the address of the array elements. It could be the case that the CPU architecture is less efficient at performing multiplication than resolving an additional pointer reference. — Douglas, Nov 22 '14 at 18:00
You don't say how you're running the tests. For C#, if you're running with the debugger attached (i.e. F5 in Visual Studio), then your timings are going to be off. Run without the debugger attached (Ctrl+F5, or "Run without debugging"). You should notice a big difference. — Jim Mischel, Nov 22 '14 at 18:03
Also, concerning jagged and rectangular arrays in C#: http://blog.mischel.com/2013/05/08/are-jagged-arrays-faster-than-rectangular-arrays/ — Jim Mischel, Nov 22 '14 at 18:05
@JimMischel yeah i ues ctr + f5 :-) and also i use `Stopwatch` — , Nov 22 '14 at 18:06
Another problem could be order of operations. C# has a strict left-to-right rule. So with `if ((dist[i,k] * dist[k,j] != 0) && (i != j))`, the multiplication is *always* done. The C++ compiler (I don't know about Java) could potentially generate code to do the `i != j` test first, which would be much faster. The same is true of the following `if` statement: put the `(dist[i,j] == 0)` test first. — Jim Mischel, Nov 22 '14 at 18:13
@Douglas Huh? If you have a flat array, to access the `(i,j)` element is `ptr[i * RowSize + j]` (simplified obviously). To do the same for a ragged array we have `tmp = ptr[i]; tmp[j]` - quite clearly there's two memory accesses there - worse of all the later even has a dependency on the first. Hence if you access a large array randomly (or do enough work on each accessed element to flush the cacheline) you do get a second cache miss. — Voo, Nov 22 '14 at 18:21
@JimMischel tested and no difference ! may be we should admit that c++ and java are faster :( — , Nov 22 '14 at 18:24
Why are you testing !=0 with a multiply? Wouldn't checking if both sides !=0 be equivalent and save a multiply> — Rup, Nov 22 '14 at 19:08
@Rup thanks that is a good point but the implementation of c++ and c# are the same ! so why c# consume more time ? — , Nov 22 '14 at 19:11
One problem you run into (as I pointed out in my blog post) is array bounds checking. In C#, every index is checked against the bounds of the array. That's going to make a difference in running time. Still, I wouldn't expect it to be double. — Jim Mischel, Nov 22 '14 at 19:21
@JimMischel i read your blog post thanks for nice post , and as you said we wouldn't expect it to be double ! — , Nov 22 '14 at 19:24
@Voo: Your first two points are correct. However, for the jagged array, it's likely that the entire outer array of pointers will fit (and remain) in L1 cache throughout the execution, whilst the inner arrays are more-or-less as likely as the `ptr[i * RowSize + j]` accesses of the flat array to incur cache misses. Although the jagged array requires two memory accesses for each element, these won't necessarily be slower than the flat array if both addresses are in the L1 cache. — Douglas, Nov 23 '14 at 00:23
@Douglas Depends how much work you do, but even in the best case you get one additional cache miss every 8 accesses the first time you iterate through it. But as soon as you do some work on the items and access the array randomly you will get quite a few more cache misses. As long as the array stays in the l1 cache the difference will be negligible indeed. In any case the flat array should never be less efficient than the jagged if the compiler is equally clever on both. — Voo, Nov 23 '14 at 00:29
@Voo: Yes, you're right; jagged arrays would get a few more misses due to the additional space required for the pointers. I remember reading somewhere that flat arrays *can* be less efficient if the CPU is slow at multiplication; I'll try to find the link later and get back to you. — Douglas, Nov 23 '14 at 00:39
@Douglas You do exchange one integer multiplication for one additional read. On Intel that's a great exchange (a l1 cache read is still about five cycles while you can do about three multiplications per cycle) on some embedded systems this may look rather different. I don't work on anything smaller than ARM CPUs so I can't speak about that. — Voo, Nov 23 '14 at 00:52
@Voo: I'm not convinced about your cycle counts. On Core i7, L1 cache read takes 4 cycles; multiplication takes 3 cycles. Instruction throughput can be improved significantly with pipelining and superscalar execution – I assume that's how you arrived at three multiplications per cycle – but that applies equally to memory access. — Douglas, Nov 23 '14 at 13:55
“Use jagged arrays instead of multidimensional arrays to benefit from MSIL performance optimizations. MSIL has specific instructions that target single dimensional zero-based arrays (SZArrays) and access to this type of array is optimized. In contrast, multidimensional arrays are accessed using the same generic code for all types, which results in boxing and unboxing for arrays of primitive types.” ([Improving Managed Code Performance](http://msdn.microsoft.com/en-us/library/ff647790.aspx#scalenetchapt05_topic27)) — Douglas, Nov 23 '14 at 14:04
@Douglas True about the latency for multiplication, got that confused (although iirc you can have more arithmetic units than IO units, so that influences bandwidth still). And yes that the CLR/JIT produce severely suboptimal code for flat arrays is known - otherwise we wouldn't have that discussion. — Voo, Nov 24 '14 at 18:53

score 1 · Accepted Answer · answered Nov 22 '14 at 21:13

One way to determine if it's array bounds checking, or at least determine if array bounds checking is part of the problem, would be to remove some of the index calculations. For example:

    static void floyd_warshallSingle()
    {
        int i, j, k;
        for (k = 0; k < n; ++k)
        {
            var distk = dist[k];
            for (i = 0; i < n; ++i)
            {
                var disti = dist[i];
                for (j = 0; j < n; ++j)
                    if ((i != j) && (disti[k] * distk[j] != 0))
                        if ((disti[j] == 0) || disti[k] + distk[j] < disti[j])
                            disti[j] = disti[k] + distk[j];
            }
        }
    }

Here all I've done is use distk as a reference to dist[k]. I suspect the compiler is already doing that optimization, which is likely how you realized the performance gain you got when you went from a rectangular array to a jagged array. But it's worth checking out.

Also, you said that you're running without the debugger attached. I assume you're also running a release build? And are all three programs (C++, Java, and C#) running the same bitness? That is, are they all 64 bit executables? All 32 bit executables? Be careful with C#, because the "Prefer 32-bit" flag could be turned on in the project options. That could cause you to run in 32-bit mode when you're compiling with "Any CPU", even on a 64-bit system.

thanks for good answer. _I assume you're also running a release build?_ it was my problem after running release version times changed to 10.108 and 2.976 — , Nov 23 '14 at 10:18

Comparing c# , c++ and java performance ( Strange behavior of c# )

c++

java

c#

1 Answers1