Why does timing drop if multiplying a local array by zero?

Question

I'm working with Cuda 5 on a Tesla C1060 GPU, Compute Capability 1.3, Ubuntu 12.04. In my kernel, each thread computes the values of a (private) local float array locArr[]. Then, the value of a float variable var is computed by using locArr[].

 __global__ void gpuKernel
(
    float *src, float *out,
    size_t memPitchAux, int w
)
{
    float locArr[256];    
    float var=0.0f;

    int tidx = blockIdx.x * blockDim.x + threadIdx.x;
    int tidy = blockIdx.y * blockDim.y + threadIdx.y;
    int idx  = tidy * memPitchAux/4 + tidx;

    for(int ic=0; ic<256; ic++)
    {
        locArr[ic] = 0.0f;
    }

   for(int is=0; is<255; is++)
    {
        int ic = fabs(src[tidy*w +tidx]);
        locArr[ic] += 1.0f;
    }

    for(int ic=0; ic<255; ic++)
    {
       var += locArr[ic];
    }

    out[idx] = var;
}

There are not enough registers per thread, so locArr[] is spilled out to global memory. The timing for executing this kernel is ~18 ms, but if I multiply locArr[] by zero then the timing drops to 1.1 ms.

for(int ic=0; ic<255; ic++)
{
   var += locArr[ic] * 0.0f;
}

I do not understand why, each thread should anyway read the needed locArr[] value in global memory and then multiply it by zero. The timing should be the same. Instead it is as if the threads already know it is not necessary to read the data because the output will be anyway zero.

Could anyone explain to me what is going on?

EDIT : if instead I have

for(int ic=0; ic<255; ic++)
{
   var += locArr[ic] * locArr2[ic];
}

where locArr2[] is a local array (spilled to global memory) of zeros, can the optimization be done at runtime?

EDIT 2: my makefile

################################################################################
#
# Makefile project only supported on Mac OSX and Linux Platforms)
#
################################################################################

# OS Name (Linux or Darwin)
OSUPPER = $(shell uname -s 2>/dev/null | tr [:lower:] [:upper:])
OSLOWER = $(shell uname -s 2>/dev/null | tr [:upper:] [:lower:])

# Flags to detect 32-bit or 64-bit OS platform
OS_SIZE = $(shell uname -m | sed -e "s/i.86/32/" -e "s/x86_64/64/")
OS_ARCH = $(shell uname -m | sed -e "s/i386/i686/")

# These flags will override any settings
ifeq ($(i386),1)
    OS_SIZE = 32
    OS_ARCH = i686
endif

ifeq ($(x86_64),1)
    OS_SIZE = 64
    OS_ARCH = x86_64
endif

# Flags to detect either a Linux system (linux) or Mac OSX (darwin)
DARWIN = $(strip $(findstring DARWIN, $(OSUPPER)))

# Location of the CUDA Toolkit binaries and libraries
CUDA_PATH       ?= /usr/local/cuda-5.0
CUDA_INC_PATH   ?= $(CUDA_PATH)/include
CUDA_BIN_PATH   ?= $(CUDA_PATH)/bin
ifneq ($(DARWIN),)
  CUDA_LIB_PATH  ?= $(CUDA_PATH)/lib
else
  ifeq ($(OS_SIZE),32)
    CUDA_LIB_PATH  ?= $(CUDA_PATH)/lib
  else
    CUDA_LIB_PATH  ?= $(CUDA_PATH)/lib64
  endif
endif

# Common binaries
NVCC            ?= $(CUDA_BIN_PATH)/nvcc
GCC             ?= g++

# Extra user flags
EXTRA_NVCCFLAGS ?=
EXTRA_LDFLAGS   ?=
EXTRA_CCFLAGS   ?=

# CUDA code generation flags
GENCODE_SM10    := -gencode arch=compute_10,code=sm_10
GENCODE_SM20    := -gencode arch=compute_20,code=sm_20
GENCODE_SM30    := -gencode arch=compute_30,code=sm_30 -gencode arch=compute_35,code=sm_35
GENCODE_FLAGS   := $(GENCODE_SM10) $(GENCODE_SM20) $(GENCODE_SM30)

# OS-specific build flags
ifneq ($(DARWIN),) 
      LDFLAGS   := -Xlinker -rpath $(CUDA_LIB_PATH) -L$(CUDA_LIB_PATH) -lcudart
      CCFLAGS   := -arch $(OS_ARCH) 
else
  ifeq ($(OS_SIZE),32)
      LDFLAGS   := -L$(CUDA_LIB_PATH) -lcudart
      CCFLAGS   := -m32
  else
      LDFLAGS   := -L$(CUDA_LIB_PATH) -lcudart
      CCFLAGS   := -m64
  endif
endif

# OS-architecture specific flags
ifeq ($(OS_SIZE),32)
      NVCCFLAGS := -m32
else
      NVCCFLAGS := -m64
endif

# OpenGL specific libraries 
ifneq ($(DARWIN),)
    # Mac OSX specific libraries and paths to include
    LIBPATH_OPENGL  := -L../../common/lib/darwin -L/System/Library/Frameworks/OpenGL.framework/Libraries -framework GLUT -lGL -lGLU ../../common/lib/darwin/libGLEW.a
else
    # Linux specific libraries and paths to include
    LIBPATH_OPENGL  := -L../../common/lib/linux/$(OS_ARCH) -L/usr/X11R6/lib -lGL -lGLU -lX11 -lXi -lXmu -lglut -lGLEW -lrt
endif

# Debug build flags
ifeq ($(dbg),1)
      CCFLAGS   += -g
      NVCCFLAGS += -g -G
      TARGET := debug
else
      TARGET := release
endif


# Common includes and paths for CUDA
INCLUDES      := -I$(CUDA_INC_PATH) -I. -I.. -I../../common/inc
LDFLAGS       += $(LIBPATH_OPENGL)

# Target rules
all: build

build: stackOverflow

stackOverflow.o: stackOverflow.cu
    $(NVCC)  $(NVCCFLAGS) $(EXTRA_NVCCFLAGS) $(GENCODE_FLAGS) $(INCLUDES)  -o $@ -c $<

stackOverflow: stackOverflow.o
    $(GCC) $(CCFLAGS) -o $@ $+ $(LDFLAGS) $(EXTRA_LDFLAGS)
    mkdir -p ./bin/$(OSLOWER)/$(TARGET)
    cp $@ ./bin/$(OSLOWER)/$(TARGET)

run: build
    ./stackOverflow

clean:
    rm -f stackOverflow.o stackOverflow *.pgm

This sounds just like what you'd expect from an optimizing compiler, to me. — unwind, Jan 21 '14 at 13:17
Just a note: your second for-loop uses a counter called "is", but "ic" is used inside it: is it a wanted behaviour? — Andrea, Jan 21 '14 at 13:23
@Andrea: yes, because the code I show is just a portion of my real code.. — user123892, Jan 21 '14 at 13:36
@unwind this is almost certainly due to compiler optimization. Do you want to provide a short answer? There are many questions on SO that deal with this topic and how to confirm it, such as [this one](http://stackoverflow.com/questions/17816202/cuda-faster-passing-values/17816445#17816445). — Robert Crovella, Jan 21 '14 at 14:25

Vitality · Accepted Answer · 2014-01-21T15:31:12.553

The hypothesis of upwind may be right.

I have disassembled your original __global__ function and the version with the multiplication by 0.f. I have changed 256 to 4 so to shorten the disassembled code. As you will see, the FADD of the original code is changed to FFMA, which could be faster and be the reason of the speedup you are observing.

ORIGINAL VERSION

MOV R1, c[0x1][0x100];
NOP;
ISUB R1, R1, 0x10;
S2R R0, SR_CTAID.Y;
S2R R2, SR_TID.Y;
IMAD R2, R0, c[0x0][0xc], R2;
S2R R3, SR_CTAID.X;
S2R R4, SR_TID.X;
IMAD R3, R3, c[0x0][0x8], R4;
MOV32I R0, 0x4;
IMAD R4, R2, c[0x0][0x38], R3;
IMAD R6.CC, R4, R0, c[0x0][0x20];
IMAD.HI.X R7, R4, R0, c[0x0][0x24];
IMUL.U32.U32 R11.CC, R2, c[0x0][0x30];
LD.E R4, [R6];
STL.64 [R1], RZ;
SHR.U32 R11, R11, 0x2;
STL.64 [R1+0x8], RZ;
IMAD.U32.U32.HI.X R5.CC, R2, c[0x0][0x30], RZ;
F2I.S32.F32.TRUNC R4, |R4|;
ISCADD R9, R4, R1, 0x2;
LDL R4, [R9];
FADD R4, R4, 1;
FADD R4, R4, 1;
FADD R4, R4, 1;
FADD R8, R4, 1;
BFE R4, R2, 0x11f;
STL [R9], R8;
IMAD.U32.U32.X R10, R4, c[0x0][0x30], R5;
LDL R7, [R1];
IMAD.U32.U32 R10, R2, c[0x0][0x34], R10;
LDL R6, [R1+0x4];
ISCADD R8, R10, R11, 0x1e;
LDL.64 R4, [R1+0x8];
IADD R3, R8, R3;
F2F.F32.F32 R2, R7;
FADD R2, R2, R6;
IMAD.U32.U32 R6.CC, R3, R0, c[0x0][0x28];
FADD R2, R2, R4;
IMAD.HI.X R7, R3, R0, c[0x0][0x2c];
FADD R0, R2, R5;
ST.E [R6], R0;
EXIT ;

MULTIPLICATION BY 0

MOV R1, c[0x1][0x100];
NOP;
ISUB R1, R1, 0x10;
S2R R0, SR_CTAID.Y;
S2R R2, SR_TID.Y;
IMAD R2, R0, c[0x0][0xc], R2;
S2R R3, SR_CTAID.X;
S2R R4, SR_TID.X;
IMAD R3, R3, c[0x0][0x8], R4;
MOV32I R0, 0x4;
IMAD R4, R2, c[0x0][0x38], R3;
IMAD R6.CC, R4, R0, c[0x0][0x20];
IMAD.HI.X R7, R4, R0, c[0x0][0x24];
IMUL.U32.U32 R11.CC, R2, c[0x0][0x30];
LD.E R4, [R6];
STL.64 [R1], RZ;
SHR.U32 R11, R11, 0x2;
STL.64 [R1+0x8], RZ;
IMAD.U32.U32.HI.X R5.CC, R2, c[0x0][0x30], RZ;
F2I.S32.F32.TRUNC R4, |R4|;
ISCADD R9, R4, R1, 0x2;
LDL R4, [R9];
FADD R4, R4, 1;
FADD R4, R4, 1;
FADD R4, R4, 1;
FADD R8, R4, 1;
BFE R4, R2, 0x11f;
STL [R9], R8;
IMAD.U32.U32.X R10, R4, c[0x0][0x30], R5;
LDL R6, [R1];
IMAD.U32.U32 R10, R2, c[0x0][0x34], R10;
LDL R7, [R1+0x4];
ISCADD R8, R10, R11, 0x1e;
LDL.64 R4, [R1+0x8];
IADD R3, R8, R3;
FFMA R2, R6, RZ, RZ;
FFMA R2, R7, RZ, R2;
IMAD.U32.U32 R6.CC, R3, R0, c[0x0][0x28];
FFMA R2, R4, RZ, R2;
IMAD.HI.X R7, R3, R0, c[0x0][0x2c];
FFMA R0, R5, RZ, R2;
ST.E [R6], R0;
EXIT ;

Thank you, this is now clear. Actually my real problem involves arrays multiplication (as in EDIT). If I run a similar kernel with different src[] arrays, I get different timings. So I thought that maybe something new to me was happening in the data addressing, but this is a new question. Could you tell how you disassembled the code? — user123892, Jan 21 '14 at 17:06
`cubojdump your_cubin_file.cubin --dump-sass`. Concerning timing, how are you measuring the timings? How different are they? Small differences in timings can be normal. Be sure to consider an average of the timings taken over a "statistically relevant" number of launches. — Vitality, Jan 21 '14 at 18:39
thank you. I posted a new question about the different timings [here](http://stackoverflow.com/questions/21280276/why-does-timing-drastically-changes-with-the-amount-of-zeros-in-the-input-data) — user123892, Jan 22 '14 at 10:30

Why does timing drop if multiplying a local array by zero?

1 Answers1