Why does timing drastically changes with the amount of zeros in the input data?

Question

I have encountered this strange issue while debugging.

In my code, I can initialize an host array srcArr_h[totArrElm] in two ways:

1)

   for(int ic=0; ic<totArrElm; ic++)
    {
     srcArr_h[ic] = (float)(rand() % 256);
    }

or

2) (half array elements will be set at runtime to zero)

   for(int ic=0; ic<totArrElm; ic++)
    {
     int randV = (rand() % 256);
     srcArr_h[ic] = randV%2;        
    }

If I use these arrays as input to a kernel function, I get drastically different timings. In particular if totArrElm = ARRDIM*ARRDIM with ARRDIM = 8192, I get

Timimg 1) 64599.3 ms

Timimg 2) 9764.1 ms

What's the trick? Of course I did verify the src host initialization is not impacting in the big time difference I get. It sounds very strnage to me, but could it be due to optimization at runtime?

Here is my code:

#include <string>
#include <stdint.h>
#include <iostream>
#include <stdio.h>
using namespace std;

#define ARRDIM 8192

__global__ void gpuKernel
(
    float *sa, float *aux,
    size_t memPitchAux, int w,
    float *c_glob
)
{
    float c_loc[256];
    float sc_loc[256];

    float g0=0.0f;

    int tidx = blockIdx.x * blockDim.x + threadIdx.x; // x-coordinate of pixel = column in device memory
    int tidy = blockIdx.y * blockDim.y + threadIdx.y; // y-coordinate of pixel = row in device memory
    int idx  = tidy * memPitchAux/4 + tidx;

    for(int ic=0; ic<256; ic++)
    {
        c_loc[ic] = 0.0f;
    }

    for(int ic=0; ic<255; ic++)
    {
        sc_loc[ic] = 0.0f;
    }

    for(int is=0; is<255; is++)
    {
        int ic = fabs(sa[tidy*w +tidx]);
        c_loc[ic] +=  1.0f;
    }

    for(int ic=0; ic<255; ic++)
    {
        g0 += c_loc[ic];
    }
    aux[idx] = g0;
    }

int main(int argc, char* argv[])
{

    float time, loop_time;
    cudaEvent_t start, stop;
    cudaEvent_t start_loop, stop_loop;

    cudaEventCreate(&start);
    cudaEventCreate(&stop);
    cudaEventRecord(start, 0) ;
    /*
     * array src host and device
     */
    int heightSrc = ARRDIM;
    int widthSrc = ARRDIM;
    cudaSetDevice(0);

    float *srcArr_h, *srcArr_d;
    size_t nBytesSrcArr = sizeof(float)*heightSrc * widthSrc;

    srcArr_h = (float *)malloc(nBytesSrcArr); // Allocate array on host
    cudaMalloc((void **) &srcArr_d, nBytesSrcArr); // Allocate array on device
    cudaMemset((void*)srcArr_d,0,nBytesSrcArr); // set to zero

    int totArrElm = heightSrc*widthSrc;

    cudaEventCreate(&start_loop);
    cudaEventCreate(&stop_loop);
    cudaEventRecord(start_loop, 0) ;

    for(int ic=0; ic<totArrElm; ic++)
    {
       srcArr_h[ic] = (float)(rand() % 256); // case 1)
//     int randV = (rand() % 256); // case 2)
//     srcArr_h[ic] = randV%2;    
    }

    cudaEventRecord(stop_loop, 0);
    cudaEventSynchronize(stop_loop);
    cudaEventElapsedTime(&loop_time, start_loop, stop_loop);
    printf("Timimg LOOP: %3.1f  ms\n", loop_time);

    cudaMemcpy( srcArr_d, srcArr_h,nBytesSrcArr,cudaMemcpyHostToDevice);

    /*
     * auxiliary buffer auxD to save final results
     */
    float *auxD;
    size_t auxDPitch;
    cudaMallocPitch((void**)&auxD,&auxDPitch,widthSrc*sizeof(float),heightSrc);
    cudaMemset2D(auxD, auxDPitch, 0, widthSrc*sizeof(float), heightSrc);

    /*
     * auxiliary buffer auxH allocation + initialization on host
     */
    size_t auxHPitch;
    auxHPitch = widthSrc*sizeof(float);
    float *auxH = (float *) malloc(heightSrc*auxHPitch);

    /*
     * kernel launch specs
     */
    int thpb_x = 16;
    int thpb_y = 16;

    int blpg_x = (int) widthSrc/thpb_x + 1;
    int blpg_y = (int) heightSrc/thpb_y +1;
    int num_threads = blpg_x * thpb_x + blpg_y * thpb_y;

    /* c_glob array */
    int cglob_w = 256;
    int cglob_h = num_threads;

    float *c_glob_d;
    size_t c_globDPitch;
    cudaMallocPitch((void**)&c_glob_d,&c_globDPitch,cglob_w*sizeof(float),cglob_h);
    cudaMemset2D(c_glob_d, c_globDPitch, 0, cglob_w*sizeof(float), cglob_h);

    /*
    * kernel launch
    */
    dim3 dimBlock(thpb_x,thpb_y, 1);
    dim3 dimGrid(blpg_x,blpg_y,1);

    gpuKernel<<<dimGrid,dimBlock>>>(srcArr_d,auxD, auxDPitch, widthSrc, c_glob_d);

    cudaThreadSynchronize();

    cudaMemcpy2D(auxH,auxHPitch,  // to CPU (host)
                 auxD,auxDPitch,  // from GPU (device)
                 auxHPitch, heightSrc, // size of data (image)
                 cudaMemcpyDeviceToHost);
    cudaThreadSynchronize();

    cudaEventRecord(stop, 0);
    cudaEventSynchronize(stop);
    cudaEventElapsedTime(&time, start, stop);
    printf("Timimg: %3.1f  ms\n", time);

    cudaFree(srcArr_d);
    cudaFree(auxD);
    cudaFree(c_glob_d);

}

My Makefile:

# OS Name (Linux or Darwin)
OSUPPER = $(shell uname -s 2>/dev/null | tr [:lower:] [:upper:])
OSLOWER = $(shell uname -s 2>/dev/null | tr [:upper:] [:lower:])

# Flags to detect 32-bit or 64-bit OS platform
OS_SIZE = $(shell uname -m | sed -e "s/i.86/32/" -e "s/x86_64/64/")
OS_ARCH = $(shell uname -m | sed -e "s/i386/i686/")

# These flags will override any settings
ifeq ($(i386),1)
    OS_SIZE = 32
    OS_ARCH = i686
endif

ifeq ($(x86_64),1)
    OS_SIZE = 64
    OS_ARCH = x86_64
endif

# Flags to detect either a Linux system (linux) or Mac OSX (darwin)
DARWIN = $(strip $(findstring DARWIN, $(OSUPPER)))

# Location of the CUDA Toolkit binaries and libraries
CUDA_PATH       ?= /usr/local/cuda-5.0
CUDA_INC_PATH   ?= $(CUDA_PATH)/include
CUDA_BIN_PATH   ?= $(CUDA_PATH)/bin
ifneq ($(DARWIN),)
  CUDA_LIB_PATH  ?= $(CUDA_PATH)/lib
else
  ifeq ($(OS_SIZE),32)
    CUDA_LIB_PATH  ?= $(CUDA_PATH)/lib
  else
    CUDA_LIB_PATH  ?= $(CUDA_PATH)/lib64
  endif
endif

# Common binaries
NVCC            ?= $(CUDA_BIN_PATH)/nvcc
GCC             ?= g++

# Extra user flags
EXTRA_NVCCFLAGS ?=
EXTRA_LDFLAGS   ?=
EXTRA_CCFLAGS   ?=

# CUDA code generation flags
# GENCODE_SM10    := -gencode arch=compute_10,code=sm_10
# GENCODE_SM20    := -gencode arch=compute_20,code=sm_20
# GENCODE_SM30    := -gencode arch=compute_30,code=sm_30 -gencode arch=compute_35,code=sm_35
GENCODE_SM10    := -gencode arch=compute_10,code=sm_10
GENCODE_SM20    := -gencode arch=compute_20,code=sm_20
GENCODE_SM30    := -gencode arch=compute_30,code=sm_30 -gencode arch=compute_35,code=sm_35
#GENCODE_FLAGS   := $(GENCODE_SM20) $(GENCODE_SM10)

GENCODE_FLAGS   := $(GENCODE_SM10) $(GENCODE_SM20) $(GENCODE_SM30)

# OS-specific build flags
ifneq ($(DARWIN),) 
      LDFLAGS   := -Xlinker -rpath $(CUDA_LIB_PATH) -L$(CUDA_LIB_PATH) -lcudart
      CCFLAGS   := -arch $(OS_ARCH) 
else
  ifeq ($(OS_SIZE),32)
      LDFLAGS   := -L$(CUDA_LIB_PATH) -lcudart
      CCFLAGS   := -m32
  else
      LDFLAGS   := -L$(CUDA_LIB_PATH) -lcudart
      CCFLAGS   := -m64
  endif
endif

# OS-architecture specific flags
ifeq ($(OS_SIZE),32)
      NVCCFLAGS := -m32
else
      NVCCFLAGS := -m64
endif

# OpenGL specific libraries 
ifneq ($(DARWIN),)
    # Mac OSX specific libraries and paths to include
    LIBPATH_OPENGL  := -L../../common/lib/darwin -L/System/Library/Frameworks/OpenGL.framework/Libraries -framework GLUT -lGL -lGLU ../../common/lib/darwin/libGLEW.a
else
    # Linux specific libraries and paths to include
    LIBPATH_OPENGL  := -L../../common/lib/linux/$(OS_ARCH) -L/usr/X11R6/lib -lGL -lGLU -lX11 -lXi -lXmu -lglut -lGLEW -lrt
endif

# Debug build flags
ifeq ($(dbg),1)
      CCFLAGS   += -g
      NVCCFLAGS += -g -G
      TARGET := debug
else
      TARGET := release 

endif


# Common includes and paths for CUDA
INCLUDES      := -I$(CUDA_INC_PATH) -I. -I.. -I../../common/inc
LDFLAGS       += $(LIBPATH_OPENGL)

# Target rules
all: build

build: stackOverflow

stackOverflow.o: stackOverflow.cu
    $(NVCC)  $(NVCCFLAGS) $(EXTRA_NVCCFLAGS) $(GENCODE_FLAGS) $(INCLUDES)  -o $@ -c $<

stackOverflow: stackOverflow.o
    $(GCC) $(CCFLAGS) -o $@ $+ $(LDFLAGS) $(EXTRA_LDFLAGS)
    mkdir -p ./bin/$(OSLOWER)/$(TARGET)
    cp $@ ./bin/$(OSLOWER)/$(TARGET)

run: build
    ./stackOverflow

clean:
    rm -f stackOverflow.o stackOverflow *.pgm

Cuda 5.0 on Tesla c1060, Ubuntu 12.04.

When you only want to get the kernel time, why do you measure over your complete code? `cudaEventRecord(start, 0);` is the first you do in your main. So in the time until `cudaEventRecord(stop, 0);` is everything, allocation on host and gpu, memory copy between host and gpu. — hubs, Jan 22 '14 at 10:43
It can only be the kernel execution time. Anyway I did a run by getting the time only for the kernel execution and I got: case 1) 62805.9 ms; case 2) 7787.0 ms — user123892, Jan 22 '14 at 10:47
Most likely you are encountering a branch-prediction optimisation: http://stackoverflow.com/questions/11227809/why-is-processing-a-sorted-array-faster-than-an-unsorted-array — Sergey L., Jan 22 '14 at 10:51
@Sergey L.: there is no branching logic in the kernel code. What might CUDA try to predict here? — AdelNick, Jan 22 '14 at 10:57
@user123892: I would try to get rid of `fabs()` and see the results although it would be really weird if it caused your problem. — AdelNick, Jan 22 '14 at 10:58
@user123892: If you use time measuring like in your posted code, of course this is wrong if you only want the ellapsed time of the kernel! But if I measure only the kernel I get complete different timings for both cases, too. — hubs, Jan 22 '14 at 11:03
@ AdelNick: no, nothing changes.. it should be something related with some mysterious runtime optimization and with local memory (spilled to global memory) access. — user123892, Jan 22 '14 at 11:05
@hubs: sorry, I did not tag you but I already replied to your point. — user123892, Jan 22 '14 at 11:07
In the second method `srcArr_h` is only filled with `[1,0,1,0...]`. In the first it's filled complete with random values. So in the second case you increase your memory bandwith significant. Not sure if this can be the only reason. — hubs, Jan 22 '14 at 11:12
@user123892: Did you read the answer and my comment? It seems that `c_loc` resides in L1-cache or global memory. In the first case you pick randomly values from it and that decreases the kernel performance. — hubs, Jan 22 '14 at 11:27
@hubs: yes, sorry, I get your point now, memory bandwidth inside the kernel to retrieve/update c_loc[ic]. — user123892, Jan 22 '14 at 11:28

AdelNick · Accepted Answer · 2014-01-29T16:57:37.250

1

Tesla C1060 GPU device has the compute capability 1.3 which means that every thread has 128 32-bit registers. It's obviously not enough to fit all your local variables (2 arrays of floats, 256 elements each, and some more variables). Since the access to the local memory in the following line

c_loc[ic] +=  1.0f;

is highly spread over the whole range 0...255 in case (1), you probably observe the register spilling which means that your data is placed into the local memory. The local memory is, in fact, located in the global one and, therefore, has the same throughput. The access can be cached but due to randomness in your algorithm, I bet caching is not very efficient. (EDIT: for compute capability 1.3 it is not even cached, it's just non-coalesced memory access). Good presentation about the Local memory in CUDA and the register spilling can be found here. There you can also find some guidance how to detect and solve the register spilling problem.

Consider reducing the amount of local data used by each thread or using the shared memory which is located on the chip and, hence, much faster.

edited Jan 29 '14 at 16:57

answered Jan 22 '14 at 11:12

AdelNick

982
1
8
17

I've profiled the kernel and think the same. In the second case `ic` alternates only between 0 and 1. I get a load/store ratio of ~2. But in the first case, where `ic` is a random value the load/store ratio drops to ~15.5. – hubs Jan 22 '14 at 11:15
Uh yes, this sounds as the issue! Any suggestions on how can I fix it? – user123892 Jan 22 '14 at 11:31
@user123892, it depends on the nature of your problem. Try to make the memory access pattern more uniform and predictable. Some thoughts just out of the head: sort the array before operating on it, reduce its dimensionality and run more threads, maybe even change your algorithm. I'm not sure if it all is possible in your case. Hard to say without thinking deep into your problem. – AdelNick Jan 22 '14 at 11:34
@AdeNick: I forgot to point it out, but I know that my c_loc[] resides in global memory due to register spilling.. Do you have any suggestions on how to solve this issue by coalescing all accesses? – user123892 Jan 22 '14 at 11:36
@user123892: I know how to solve it by coalescing the access but I don't know how to make the access coalesced in your particular case ;) – AdelNick Jan 22 '14 at 11:39
@AdelNick & hubs thank you, you have been very helpful. Now I'll dedicate myself to make accesses coalesced :-) – user123892 Jan 22 '14 at 14:35
@AdelNick & hubs: just to be sure I get it, in case 2) `ic` has just 2 values while in case 1) 256 values. If I get it correct, each thread will access in 2) 256 times only two memory locations (`c_loc[0]` and `c_loc[1]`) while in 1) each thread will access 256 times the 256 different elements of `c_loc[]`. You say 2) is not affected by register spilling. Why is that? I Initialize `c_loc[]` and `sc_loc[]` at kernel beginning, so in both cases array elements allocation to registers or global memory should be the same. Is 2) faster because `c_loc[0]` and `c_loc[1]` are cached? – user123892 Jan 22 '14 at 16:09
.. how can global memory accesses be cached if I have a 1.3 compute capability device? as far as I know, caching has been introduced for devices with 2.x devices. – user123892 Jan 29 '14 at 16:27
@user123892: true, I forgot about your concrete device, when was writing the answer. See the edits :) Btw, I can't see how the index `is` is used in the loop we talk about. Is it always the same element of `sa` which is being evaluated? – AdelNick Jan 29 '14 at 16:58
@AdelNick: I do not understand how non-coalesced accesses can affect timings if only the values of ``sa`` are changing and not the addressing (the source array in both cases is built up in the same way, just the filled numbers are different). Sorry. Right, for each thread, the loop is pointing always to the same element, but in my actual code I'm selecting the elements residing close to the element pointed by the current thread (``sa[(tidy + window_y)*w +tidx + window_x]``). I get the same timing problem. In the code posted here, I just wanted to make the code easier to read. – user123892 Jan 29 '14 at 17:44
1

@user123892: the non-coalesced access in your case doesn't lead to such a big difference. It slows down your kernel but in both cases. The source of the difference is register spilling, IMHO. Use `--ptxas options=-v ` compiler options to see the information about register usage. Btw, perhaps you can make the access coalesced by declaring your arrays in the global memory and using different indexing, so threads from the same warp can access adjacent memory addresses. – AdelNick Jan 30 '14 at 08:44
@AdelNick: as you also suggests, coalescing is not responsible of the big difference in timings. Ok, so in few words in case 2) I reuse much often data already stored in the registers; this results in a reduction of register spilling. Thank you for your help. – user123892 Jan 30 '14 at 10:19

Why does timing drastically changes with the amount of zeros in the input data?

1 Answers1

Linked