I've tested an example demonstrated in this talk [pytables] using numpy (page 20/57).
It is stated, that a[:,1].sum()
takes 9.3 ms, whereas a[1,:].sum()
takes only 72 us.
I tried to reproduce it, but failed to do so. Am I measuring wrongly? Or have things changed in NumPy since 2010?
$ python2 -m timeit -n1000 --setup \
'import numpy as np; a = np.random.randn(4000,4000);' 'a[:,1].sum()'
1000 loops, best of 3: 16.5 usec per loop
$ python2 -m timeit -n1000 --setup \
'import numpy as np; a = np.random.randn(4000,4000);' 'a[1,:].sum()'
1000 loops, best of 3: 13.8 usec per loop
$ python2 --version
Python 2.7.7
$ python2 -c 'import numpy; print numpy.version.version'
1.8.1
While I can measure a benefit of the second version (supposedly fewer cache misses because numpy uses C-style row ordering), I don't see that drastic difference as stated by the pytables contributor.
Also, it seems I cannot see more cache misses when using column V row summation.
EDIT
So far the insight for me was that I was using the
timeit
module in the wrong way. Repeated runs with the same array (or row/column of an array) will almost certainly be cached (I've got32KiB
of L1 data cache, so a line fits well inside:4000 * 4 byte = 15k < 32k
).Using the script in the answer of @alim with a single loop (
nloop=1
) and ten trialsnrep=10
, and varying the size of the random array (n x n
) I am measuringn row/us col/us penalty col 1k 90 100 1 4k 100 210 2 10k* 110 350 3.5 20k* 120 1200 10
*
n=10k
and higher doesn't fit into the L1d cache anymore.
I'm still not sure about tracing down the cause of this as perf
shows about the same rate of cache misses (sometimes even a higher rate) for the faster row sum.
Perf
data:
nloop = 2
and nrep=2
, so I expect some of the data still in the cache... for the second run.
Row sum n=10k
perf stat -B -e cache-references,cache-misses,L1-dcache-loads,L1-dcache-load-misses,L1-dcache-stores,L1-dcache-store-misses,L1-dcache-prefetches,cycles,instructions,branches,faults,migrations ./answer1.py 2>&1 | sed 's/^/ /g'
row sum: 103.593 us
Performance counter stats for './answer1.py':
25850670 cache-references [30.04%]
1321945 cache-misses # 5.114 % of all cache refs [20.04%]
5706371393 L1-dcache-loads [20.00%]
11733777 L1-dcache-load-misses # 0.21% of all L1-dcache hits [19.97%]
2401264190 L1-dcache-stores [20.04%]
131964213 L1-dcache-store-misses [20.03%]
2007640 L1-dcache-prefetches [20.04%]
21894150686 cycles [20.02%]
24582770606 instructions # 1.12 insns per cycle [30.06%]
3534308182 branches [30.01%]
3767 faults
6 migrations
7.331092823 seconds time elapsed
Column sum n=10k
perf stat -B -e cache-references,cache-misses,L1-dcache-loads,L1-dcache-load-misses,L1-dcache-stores,L1-dcache-store-misses,L1-dcache-prefetches,cycles,instructions,branches,faults,migrations ./answer1.py 2>&1 | sed 's/^/ /g'
column sum: 377.059 us
Performance counter stats for './answer1.py':
26673628 cache-references [30.02%]
1409989 cache-misses # 5.286 % of all cache refs [20.07%]
5676222625 L1-dcache-loads [20.06%]
11050999 L1-dcache-load-misses # 0.19% of all L1-dcache hits [19.99%]
2405281776 L1-dcache-stores [20.01%]
126425747 L1-dcache-store-misses [20.02%]
2128076 L1-dcache-prefetches [20.04%]
21876671763 cycles [20.00%]
24607897857 instructions # 1.12 insns per cycle [30.00%]
3536753654 branches [29.98%]
3763 faults
9 migrations
7.327833360 seconds time elapsed
EDIT2
I think I have understood some aspects, but the question has not been answered yet I think. At the moment I think this summation example doesn't reveal anything about CPU caches at all. In order to eliminate uncertainty by numpy/python, I tried to use perf
on doing the summation in C, and the results are in an answer below.