Column wise sum V row wise sum: Why don't I see a difference using NumPy?

Question

I've tested an example demonstrated in this talk [pytables] using numpy (page 20/57).

It is stated, that a[:,1].sum() takes 9.3 ms, whereas a[1,:].sum() takes only 72 us.

I tried to reproduce it, but failed to do so. Am I measuring wrongly? Or have things changed in NumPy since 2010?

$ python2 -m timeit -n1000 --setup \ 
  'import numpy as np; a = np.random.randn(4000,4000);' 'a[:,1].sum()' 
1000 loops, best of 3: 16.5 usec per loop

$ python2 -m timeit -n1000 --setup \ 
  'import numpy as np; a = np.random.randn(4000,4000);' 'a[1,:].sum()' 
1000 loops, best of 3: 13.8 usec per loop

$ python2 --version
Python 2.7.7
$ python2 -c 'import numpy; print numpy.version.version'
1.8.1

While I can measure a benefit of the second version (supposedly fewer cache misses because numpy uses C-style row ordering), I don't see that drastic difference as stated by the pytables contributor.

Also, it seems I cannot see more cache misses when using column V row summation.

EDIT

So far the insight for me was that I was using the timeit module in the wrong way. Repeated runs with the same array (or row/column of an array) will almost certainly be cached (I've got 32KiB of L1 data cache, so a line fits well inside: 4000 * 4 byte = 15k < 32k).
Using the script in the answer of @alim with a single loop (nloop=1) and ten trials nrep=10, and varying the size of the random array (n x n) I am measuring
```
 n   row/us col/us    penalty col
 1k     90    100         1
 4k    100    210         2
10k*   110    350         3.5
20k*   120   1200        10
```
* n=10k and higher doesn't fit into the L1d cache anymore.

I'm still not sure about tracing down the cause of this as perf shows about the same rate of cache misses (sometimes even a higher rate) for the faster row sum.

`Perf` data:

nloop = 2 and nrep=2, so I expect some of the data still in the cache... for the second run.

Row sum `n=10k`

 perf stat -B -e cache-references,cache-misses,L1-dcache-loads,L1-dcache-load-misses,L1-dcache-stores,L1-dcache-store-misses,L1-dcache-prefetches,cycles,instructions,branches,faults,migrations ./answer1.py 2>&1 | sed 's/^/    /g'
row sum:    103.593 us
 Performance counter stats for './answer1.py':
          25850670      cache-references                                             [30.04%]
           1321945      cache-misses              #    5.114 % of all cache refs     [20.04%]
        5706371393      L1-dcache-loads                                              [20.00%]
          11733777      L1-dcache-load-misses     #    0.21% of all L1-dcache hits   [19.97%]
        2401264190      L1-dcache-stores                                             [20.04%]
         131964213      L1-dcache-store-misses                                       [20.03%]
           2007640      L1-dcache-prefetches                                         [20.04%]
       21894150686      cycles                    [20.02%]
       24582770606      instructions              #    1.12  insns per cycle         [30.06%]
        3534308182      branches                                                     [30.01%]
              3767      faults
                 6      migrations
       7.331092823 seconds time elapsed

Column sum `n=10k`

 perf stat -B -e cache-references,cache-misses,L1-dcache-loads,L1-dcache-load-misses,L1-dcache-stores,L1-dcache-store-misses,L1-dcache-prefetches,cycles,instructions,branches,faults,migrations ./answer1.py 2>&1 | sed 's/^/    /g'
column sum: 377.059 us
 Performance counter stats for './answer1.py':
          26673628      cache-references                                             [30.02%]
           1409989      cache-misses              #    5.286 % of all cache refs     [20.07%]
        5676222625      L1-dcache-loads                                              [20.06%]
          11050999      L1-dcache-load-misses     #    0.19% of all L1-dcache hits   [19.99%]
        2405281776      L1-dcache-stores                                             [20.01%]
         126425747      L1-dcache-store-misses                                       [20.02%]
           2128076      L1-dcache-prefetches                                         [20.04%]
       21876671763      cycles                    [20.00%]
       24607897857      instructions              #    1.12  insns per cycle         [30.00%]
        3536753654      branches                                                     [29.98%]
              3763      faults
                 9      migrations
       7.327833360 seconds time elapsed

EDIT2 I think I have understood some aspects, but the question has not been answered yet I think. At the moment I think this summation example doesn't reveal anything about CPU caches at all. In order to eliminate uncertainty by numpy/python, I tried to use perf on doing the summation in C, and the results are in an answer below.

I see the full order of magnitude difference both with a 64 bit Windows numpy 1.8.1 , and with a 32 bit Windows numpy master. What hardware are you using? — Jaime, Jul 14 '14 at 14:41
From `lscpu` I'm reading `Architecture: x86_64, CPU(s): 2, Model name: Intel(R) Core(TM)2 Duo CPU E6850 @ 3.00GHz, L1d cache: 32K, L1i cache: 32K, L2 cache: 4096K` — Sebastian, Jul 14 '14 at 15:26

ali_m · Answer 1 · 2014-07-18T21:27:47.530

I don't see anything wrong with your attempt at replication, but bear in mind that those slides are from 2010, and numpy has changed rather a lot since then. Based on the dates of numpy releases, I would guess that Francesc was probably using v1.5.

Using this script to benchmark row v column sums:

#!python

import numpy as np
import timeit

print "numpy version == " + str(np.__version__)

setup = "import numpy as np; a = np.random.randn(4000, 4000)"

rsum = "a[1, :].sum()"
csum = "a[:, 1].sum()"

nloop = 1000
nrep = 3

print "row sum:\t%.3f us" % (
    min(timeit.repeat(rsum, setup, repeat=nrep, number=nloop)) / nloop * 1E6)
print "column sum:\t%.3f us" % (
    min(timeit.repeat(csum, setup, repeat=nrep, number=nloop)) / nloop * 1E6)

I detect about a 50% slowdown for column sums with numpy v1.5:

$ python sum_benchmark.py 
numpy version == 1.5.0
row sum:        8.472 us
column sum:     12.759 us

Compared with about a 30% slowdown with v1.8.1, which you're using:

$ python sum_benchmark.py 
numpy version == 1.8.1
row sum:        12.108 us
column sum:     15.768 us

It's interesting to note that both types of reduction have actually gotten a bit slower in the more recent numpy versions. I would have to delve a lot deeper into numpy's source code to understand exactly why this is the case.

Update

For the record, I'm running Ubuntu 14.04 (kernel v3.13.0-30) on a quad-core i7-2630QM CPU @ 2.0GHz. Both versions of numpy were pip-installed and compiled using GCC-4.8.1.
I realize my original benchmarking script wasn't totally self-explanatory - you need to divide the total time by the number of loops (1000) in order to get the time per call.
It also probably makes more sense to take the minimum across repeats rather than the average, since this is more likely to represent the lower bound on the execution time (on top of which you'd get variability due to background processes etc.).

I've updated my script and results above accordingly

We can also negate any effect of caching across calls (temporal locality) by creating a brand-new random array for every call - just set nloop to 1 and nrep to a reasonably small number (unless you really enjoy watching paint dry), say 10.

nloop=1, nreps=10 on a 4000x4000 array:

numpy version == 1.5.0
row sum:        47.922 us
column sum:     103.235 us

numpy version == 1.8.1
row sum:        66.996 us
column sum:     125.885 us

That's a bit more like it, but I still can't really replicate the massive effect that Francesc's slides show. Perhaps this isn't that surprising, though - the effect may be very compiler-, architecture, and/or kernel-dependent.

@Sebastian Yeah, I have a feeling that particular run might have been a bit of a fluke (also see my updates). Taking the min rather than the mean, I'm now consistently seeing a similar magnitude of difference to you with 1.8.1. — ali_m, Jul 15 '14 at 00:24
thanks for the updated script, which results in consistent performance (`number=1`, `repeat=10`): row `~100us`, column `~210us`. So I conclude that there is a factor 2 for the two operations under these circumstances. I cannot, however, trace this back to *cache misses* using the `perf` tool. Both versions have about the same cache misses (~5% of all cache refs). — Sebastian, Jul 15 '14 at 07:54
I'm seeing about 31% misses for both row and column sums with `nloop=1`, `nreps=10` in numpy 1.8.1. The point of setting the number of loops to 1 is to negate any performance gains due to the array row/column already being cached from the previous call (since it will be a different random array each time), so I was not really expecting to see much difference in cache misses. It's interesting to see that when I go back to `nloop=1000`, `nreps=3` I actually see a slightly *greater* rate of cache misses for the row sum (17% vs 13%), even though it's faster than the column sum. — ali_m, Jul 15 '14 at 08:29
right. I see the same trend: *higher rate* of misses by the row summation: `4.3%` V `3.5%` for $n=4000$, but faster summation. PS: I've updated my question with new measurements. — Sebastian, Jul 15 '14 at 12:21

Francesc · Answer 2 · 2014-07-14T15:50:13.633

Interesting. I can reproduce Sebastian's performance:

In [21]: np.__version__ 
Out[21]: '1.8.1'

In [22]: a = np.random.randn(4000, 4000)

In [23]: %timeit a[:, 1].sum()
100000 loops, best of 3: 12.4 µs per loop

In [24]: %timeit a[1, :].sum()
100000 loops, best of 3: 10.6 µs per loop

However, if I try with a larger array:

In [25]: a = np.random.randn(10000, 10000)

In [26]: %timeit a[:, 1].sum()
10000 loops, best of 3: 21.8 µs per loop

In [27]: %timeit a[1, :].sum()
100000 loops, best of 3: 15.8 µs per loop

but, if I try again:

In [28]: a = np.random.randn(10000, 10000)

In [29]: %timeit a[:, 1].sum()
10000 loops, best of 3: 64.4 µs per loop

In [30]: %timeit a[1, :].sum()
100000 loops, best of 3: 15.9 µs per loop

so, not sure what's going on here, but this jitter is probably due to cache effects. Perhaps new architectures are being wiser in predicting pattern access and hence, doing better prefetching?

At any rate, and for comparison matters, I am using NumPy 1.8.1, Linux Ubuntu 14.04 and a laptop with a i5-3380M CPU @ 2.90GHz.

EDIT: After thinking a bit on this, yes I would say that the first time that timeit executes the sum, the column (or the row) is fetched from RAM, but the second time that the operation runs, the data is in cache (for both the row-wise and column-wise versions), so it executes fast. As timeit takes the minimum of the runs, this is why we don't see a big difference in times.

Another question is why we see the difference sometimes (using timeit). But caches are weird beasts, most specially in multicore machines executing multiple processes at a time.

score 2 · Answer 3 · edited May 23 '17 at 12:29

I wrote the summation example in C: The results are shown for CPU time measurements, and I always used gcc -O1 using-c.c to compile (gcc version: gcc version 4.9.0 20140604). Source code is below.

I chose the matrix size to be n x n. For n<2k the row and column summation do not have any measurable difference (6-7 us per run for n=2k).

Row summation

n     first/us      converged/us
 1k       5                 4
 4k      19                12
10k      35                31
20k      70                61
30k     130                90

e.g n=20k

Run 0 taken 70 cycles. 0 ms 70 us
Run 1 taken 61 cycles. 0 ms 60 us # this is the minimum I've seen in all tests
Run 1 taken 61 cycles. 0 ms 61 us
<snip> (always 60/61 cycles)

Column

n     first/us      converged/us
 1k       5                 4
 4k     112                14
10k     228                32
20k     550               246
30k    1000               300

e.g n=20k

Run 0 taken 552 cycles. 0 ms 552 us
Run 1 taken 358 cycles. 0 ms 358 us
Run 2 taken 291 cycles. 0 ms 291 us
Run 3 taken 264 cycles. 0 ms 264 us
Run 4 taken 252 cycles. 0 ms 252 us
Run 5 taken 275 cycles. 0 ms 275 us
Run 6 taken 262 cycles. 0 ms 262 us
Run 7 taken 249 cycles. 0 ms 249 us
Run 8 taken 249 cycles. 0 ms 249 us
Run 9 taken 246 cycles. 0 ms 246 us

discussion

Row summation is faster. I doesn't benefit much from any caching, i.e. repeated sums are not much faster than the initial sum. Column summation is much slower, but it steadily increases for 5-8 iterations. The increase is is most pronounced for n=4k to n=10k where caching helps to increase the speed about tenfold. At larger arrays, the speedup is only about a factor 2. I also observe that while row summation converges very quickly (after one or two trials), column summation convergence takes many more iterations (5 or more).

Take away lesson for me:

For large arrays (more than 2k elements) there is a difference in summation speed. I believe this is due to synergies when fetching data from RAM to L1d cache. Although I don't know the block/line size of one read, I assume it is larger than 8 bytes. So the next element to sum up is already in the cache.
Column sum speed is first and foremost limited by memory bandwidth. The CPU seems to starve for data as the spread out chunks are read from RAM.
When performing the summation repeatedly, one expects that some data doesn't need to be fetched from RAM and is already present in L2/L1d cache. For row summation, this is noticeable only for n>30k, for column summation it becomes apparent already at n>2k.

Using perf, I don't see a large difference though. But the bulk work of the C program is filling the array with random data. I don't know how I can eliminate this "setup" data...

Here Is the C code for this example:

#include <stdio.h>
#include <stdlib.h> // see `man random`
#include <time.h> // man  time.h, info clock

int
main (void)
{
  // seed
  srandom(62);
  //printf ("test %g\n", (double)random()/(double)RAND_MAX);
  const size_t SIZE = 20E3;
  const size_t RUNS = 10;
  double (*b)[SIZE];
  printf ("Array size: %dx%d, each %d bytes. slice = %f KiB\n", SIZE, SIZE, 
      sizeof(double), ((double)SIZE)*sizeof(double)/1024);

  b = malloc(sizeof *b * SIZE);

  //double a[SIZE][SIZE]; // too large!
  int i,j;
  for (i = 0; i< SIZE; i++) {
    for (j = 0; j < SIZE; j++) {
      b[i][j] = (double)random()/(double)RAND_MAX;
    }
  }
  double sum = 0;
  int run = 0;
  clock_t start, diff;
  int usec;
  for (run = 0; run < RUNS; run++) {
    start = clock();
    for (i = 0; i<SIZE; i++) {
      // column wise (slower?)
      sum += b[i][1];
      // row wise (faster?)
      //sum += b[1][i];
    }
    diff = clock() - start;
    usec = ((double) diff*1e6) / CLOCKS_PER_SEC; // https://stackoverflow.com/a/459704/543411
    printf("Run %d taken %d cycles. %d ms %d us\n",run, diff, usec/1000, usec%1000);
  }
  printf("Sum: %g\n", sum);
  return 0;
}

hunse · Answer 4 · 2014-07-14T15:21:35.877

I'm using Numpy 1.9.0.def-ff7d5f9, and I see a 10x difference when executing the two test lines you posted. I wouldn't be surprised if your machine and what compiler you've used to build Numpy is as important to the speedup as the Numpy version is.

In practice though, I don't think it's too common to want to do a reduction of a single column or row like this. I think a better test would be to compare reducing across all rows

a.sum(axis=0)

with reducing across all columns

a.sum(axis=1)

For me, these two operations have only a small difference in speed (reducing across columns takes about 95% of the time of reducing across rows).

EDIT: In general, I'm very wary about comparing the speed of operations that take on the order of microseconds. When installing Numpy, it's very important to have a good BLAS library linked with it, since this is what does the heavy lifting when it comes to most large matrix operations (such as matrix-matrix multiplies). When comparing BLAS libraries, you definitely want to use intensive operations like matrix-matrix dot products as the point of comparison, since this is where you'll spend the vast majority of your time. I've found that sometimes, a worse BLAS library will actually have a slightly faster vector-vector dot operation than a better BLAS library. What makes it worse, though, is that operations like matrix-matrix dot products and eigenvalue decompositions take tens of times longer, and these matter much more than a cheap vector-vector dot. I think these differences often appear because you can write a reasonably fast vector-vector dot in C without much thought, but writing a good matrix-matrix dot takes a lot of thought and optimization, and is the more costly operation, so this is where good BLAS packages put their effort.

The same is true in Numpy: any optimization is going to be done on larger operations, not small ones, so don't get hung up on speed differences between small operations. Furthermore, it's hard to tell if any speed difference with a small operation is really due the time of the computation, or is just due to overhead that is put in to optimize more costly operations.

Do you know how I find out which BLAS I'm using? I don't remember if I changed the default. — Sebastian, Jul 14 '14 at 15:32
Sebastian, in this example BLAS is not used at all. Hunse's advice of use BLAS whenever you can is good, but to no effect for this case. — Francesc, Jul 14 '14 at 15:43
@sebastian: Ahh, yes, sorry. I meant to mention that these reductions do not use BLAS. I just brought that up because when trying to set up my BLAS, I did a lot of benchmarking, and realized that measuring the time of short operations, such as your single-row reductions, is not indicative of real-world performance. That said, to check your BLAS use `numpy.show_config()`. — hunse, Jul 14 '14 at 15:59

Column wise sum V row wise sum: Why don't I see a difference using NumPy?

Perf data:

Row sum n=10k

Column sum n=10k

4 Answers4