-3

I am trying to replicate the linear programming solver that this person has made

http://www.idi.ntnu.no/~elster/master-studs/spampinato/spampinato-linear-prog-gpu-report.pdf.

The device I am using is Quadro FX 1800M with compute capability 1.2.

My problem is that when I launch more than 22 threads per block then most of the time I get inaccurate results (sometimes all zeros), however in unique cases I get accurate results when I launch even 512 threads per block.

Here are some test runs that I made. (Sequential Implies a CPU based Version) used for comparison

Iteration No 1 : of Sequential Version
Optimum Found 24.915583
Elapsed time: 0.001049725

Iteration No 1: of Parallel Version
BS-(Number of Threads) = : 20
Optimum found: 24.915583

Iteration No 2: of Parallel Version
BS-(Number of Threads) = : 256
Optimum found: 24.915607

Iteration No 3: of Parallel Version
BS-(Number of Threads) = : 512
Optimum found: 24.917068

Iteration No 4: of Parallel Version
BS-(Number of Threads) = : 2
Optimum found: 24.915583

Iteration No 5: of Parallel Version
BS-(Number of Threads) = : 456 
Optimum found: -30693000299230806209574138333792043008.000000

Iteration No 6: of Parallel Version
BS-(Number of Threads) = : 456
Problem unsolvable: either qth==0 or loop too long.

Iteration No 7: of Parallel Version
BS-(Number of Threads) = : 512
Optimum found: 25.010513

Iteration No 8: of Parallel Version
BS-(Number of Threads) = : 256
Problem unsolvable: either qth==0 or loop too long.

Iteration No 9: of Parallel Version
BS-(Number of Threads) = : 256
Optimum found: 0.000000

Iteration No 10: of Parallel Version
BS-(Number of Threads) = : 512
Optimum found: 0.000000

Can somebody kindly point what I might be doing wrong, I know that I haven't posted the code but I am assuming that the code is correct as I am copying it from the research paper and the problem is on my end.

I should also point out that I am getting the following error when compiling the cuda code

ptxas /tmp/tmpxft_000017e7_00000000-10_culiblp.ptx, line 263; warning : Double is not supported. Demoting to float

Might this be a reason for the results?

nobody
  • 19,814
  • 17
  • 56
  • 77
Ahmad
  • 39
  • 7

1 Answers1

3

My problem is that when I launch more than 22 threads per block then most of the time I get inaccurate results(sometimes all zeros),

Can somebody kindly point what I might be doing wrong,

I wasn't able to build the code because the header files seem to be missing from the paper. I could try and construct those, but one thing I noticed is that the variable (or constant) BS doesn't seem to be defined anywhere. So I'm guessing it was originally defined in culiblp.h (which is not provided.)

Looking at culiblp.cu in the paper, I notice some kernel launches like this:

init_AInD<<<dim3(kn, km1), dim3(BS, BS)>>>(devA, devD, m, n);
                           ^^^^^^^^^^^^

This is creating a 2D threadblock of dimensions BS*BS. So if you set BS to a value greater than 22, the product will exceed 512 threads, which is the maximum for your cc1.x GPU. In that case, setting BS to a value higher than 22 will cause that kernel launch to fail.

I believe this is certainly a contributing factor to code failure when BS is larger than 22.

You could prove this out by running your code with cuda-memcheck. Also, if you plan to work with this code, I'd suggest adding proper cuda error checking.

The apparent, occasional successes with values higher than 22 for BS could possibly be explained if you had done a successful run (let's say with BS at 22 or less) immediately prior. It's possible that even with a failed kernel, if the previous successful intermediate data is left in memory from the previous run, that things will seem to produce the correct results.

Community
  • 1
  • 1
Robert Crovella
  • 143,785
  • 11
  • 213
  • 257