2

I've compared processing time with theano(CPU), theano(GPU) and Scikit-learn(CPU) using Python. But, I got strange result. Here look at the graph that I plot.

Processing Time Comparison:

Processing Time Comparison

you can see the result of scikit-learn that is faster than theano(GPU). The program that I checked its elapsed time is to compute euclidean distance matrix from a matrix which have n * 40 elements.

Here is the part of code.

points = T.fmatrix("points")
edm = T.zeros_like(points)

def get_point_to_points_euclidean_distances(point_id):
    euclideans = (T.sqrt((T.sqr(points- points[point_id, : ])).sum(axis=1)))

    return euclideans

def get_EDM_CPU(points):
    EDM = np.zeros((points.shape[0], points.shape[0])).astype(np.float32)
    for row in range(points.shape[0]):
        EDM[row, :] = np.sqrt(np.sum((points - points[row, :])**2, axis=1))

    return EDM

def get_sk(points):
    EDM = sk.pairwise_distances(a, metric='l2')

    return EDM

seq = T.arange(T.shape(points)[0])
(result, _) = theano.scan(fn = get_point_to_points_euclidean_distances, \
outputs_info = None , \
sequences = seq)

get_EDM_GPU = theano.function(inputs = [points], outputs = result, allow_input_downcast = True)

I thought that the reason why GPU is slower than sci-kit learn is probably transfer time. So I did profiling GPU with nvprof command. then I got this.

==27105== NVPROF is profiling process 27105, command: python ./EDM_test.py
Using gpu device 0: GeForce GTX 580 (CNMeM is disabled, cuDNN not available)
data shape :  (10000, 40)
get_EDM_GPU elapsed time :  1.84863090515 (s)
get_EDM_CPU elapsed time :  8.09937691689 (s)
get_EDM_sk elapsed time :  1.10968112946 (s)
ratio :  4.38128395145
==27105== Profiling application: python ./EDM_test.py
==27105== Warning: Found 9 invalid records in the result.
==27105== Warning: This could be because device ran out of memory when profiling.
==27105== Profiling result:
Time(%)      Time     Calls       Avg       Min       Max  Name
 71.34%  1.28028s      9998  128.05us  127.65us  128.78us  kernel_reduce_01_node_316e2e1cbfbe8cfb8e4a101f329ffeec_0(int, int, float const *, int, int, float*, int)
 19.95%  357.97ms      9997  35.807us  35.068us  36.948us  kernel_Sub_node_bc41b3f8f12c93d29f2c4360ad445d80_0_2(unsigned int, int, int, float const *, int, int, float const *, int, int, float*, int, int)
  7.32%  131.38ms         2  65.690ms  1.2480us  131.38ms  [CUDA memcpy DtoH]
  1.25%  22.456ms      9996  2.2460us  2.1140us  2.8420us  kernel_Sqrt_node_23508f8f49d12f3e8369d543f5620c15_0_Ccontiguous(unsigned int, float const *, float*)
  0.12%  2.1847ms         1  2.1847ms  2.1847ms  2.1847ms  [CUDA memset]
  0.01%  259.73us         5  51.946us     640ns  250.36us  [CUDA memcpy HtoD]
  0.00%  17.086us         1  17.086us  17.086us  17.086us  kernel_reduce_ccontig_node_97496c4d3cf9a06dc4082cc141f918d2_0(unsigned int, float const *, float*)
  0.00%  2.0090us         1  2.0090us  2.0090us  2.0090us  void copy_kernel<float, int=0>(cublasCopyParams<float>)

The transfer [CUDA memcpy DtoH] was performed twice { 1.248 [us], 131.38 [ms] }

The transfer [CUDA memcpy HtoD] was performed 5x { min: 640 [ns], max: 250.36 [us] }

The transfer time is about 131.639 ms (131.88 ms + 259.73 us). but the gap between GPU and scikit-learn is about 700ms (1.8 s - 1.1 s) So, the gap is over the transfer time.

does it compute only upper triangular matrix from symmetric matrix?

what makes scikit-learn so fast?

Worthy7
  • 1,455
  • 15
  • 28
Holden
  • 23
  • 5
  • What is CPU and GPU on the first graph? – Worthy7 Sep 05 '17 at 05:18
  • @Worthy7 CPU means that it computes euclidean distance matrix by using for-loop statement and GPU means that it computes the matrix by using theano library with GPU. – Holden Sep 05 '17 at 07:44
  • No but CPU is a piece of hardware, SKLEARN is a python framework. You can't just put those two on a graph. Anyway, I think you just mean running pure python vs using sklearn right. sklearn is optimized internally - it's that simple. Try a much larger set of data please :) Let's say 100X bigger – Worthy7 Sep 06 '17 at 04:46
  • @Worthy7 Theano has both CPU and GPU backend. He's talking about using that. – Kh40tiK Sep 11 '17 at 22:51
  • Fair enough. Larger test! – Worthy7 Sep 12 '17 at 00:48

1 Answers1

2

What makes scikit-learn ( on pure CPU-side ) so fast?

My initial candidates would be a mix of:

  • highly efficient use of available CPU-cores' L1-/ L2- sizes within the fastest [ns]-distances
  • smart numpy vectorised execution being friendly to CPU cache-lines
  • dataset so small, it can completely remain non-evicted from cache ( test to scale the dataset-under-review way above the L2-/L3-cache sizes to see the DDRx-memory-cost effects on the observed performance ( details are in the URL below ) )
  • might enjoy even better timing on numpy, if avoiding .astype() conversions ( test it )

Facts on the GPU-side

  • auto-generated GPU-kernels do not have much chance to get ultimate levels of global memory latency-masking, compared to manually tweaked kernel-designs, tailor fit to respective GPU-silicon-architecture / latencies observed in-vivo
  • data-structures larger than just a few KB remain paying GPU-SM/GDDR-MEM distances of ~ large hundreds of [ns], nearly [us] -v/s- compared to small units ~ small tens of [ns] at CPU/L1/L2/L3/DDRx ) ref. timing details in >>> https://stackoverflow.com/a/33065382
  • not being able to enjoy much of the GPU/SMX power, due to this task's obvious low-reuse of data points and dataset size beyond the GPU/SM-silicon limits, that causes and must cause GPU/SM-register capacity spillovers in any kind of GPU-kernel design attempts and tweaking
  • the global task is not having a minimum reasonable amount of asynchronous, isolated ( non-communicating islands ) mathematically-dense, yet SMX-local, GPU-kernel processing steps ( there is not much to compute so as to adjust for the add-on overheads and expensive SMX/GDDR memory costs )

GPU-s can lovely exhibit it's best-performance, if sufficiently enough densely-convoluted re-processing operations take place -- like in large-scale/high-resolution image-processing -- on [m,n,o]-convolution-kernel matrices so small, so as that all these m*n*o constant values can reside local to SM, inside an available set of SMX-SM_registers and if the GPU-kernel-launchers are optimally tweaked by the 3D-tblock/grid processing-layout geometries, so that the global memory access latencies are at its best-masked performance, having all the GPU-threads enforced within the hardware WARP-aligned SMx:WarpScheduler RoundRobin thread-scheduling capabilites ( the first swap from Round-Robin into Greedy-WarpSchedule mode loses the whole battle in case of divergent execution-paths in GPU-kernel-code ).

user3666197
  • 1
  • 6
  • 50
  • 92