numpy.cos works significantly longer on certain numbers

Question

TLDR:

numpy.cos() works 30% longer on a particular numbers (exactly 24000.0 for example). Adding a small delta (+0.01) causes numpy.cos() to work as usual.

I have no idea why.

I stumbled across a strange problem during my work with numpy. I was checking cache work and accidentally made a wrong graph - how numpy.cos(X) time depends on X. Here is my modified code (copied from my Jupyter notebook):

import numpy as np
import timeit
st = 'import numpy as np'
cmp = []
cmp_list = []
left = 0
right = 50000
step = 1000
# Loop for additional average smoothing
for _ in range(10):
    cmp_list = []
    # Calculate np.cos depending on its argument
    for i in range(left, right, step):
        s=(timeit.timeit('np.cos({})'.format(i), number=15000, setup=st))
        cmp_list.append(int(s*1000)/1000)
    cmp.append(cmp_list)

# Calculate average times
av=[np.average([cmp[i][j] for i in range(len(cmp))]) for j in range(len(cmp[0]))]

# Draw the graph
import matplotlib.pyplot as plt
fig = plt.figure()
ax = fig.add_subplot(1, 1, 1)
plt.plot(range(left, right, step), av, marker='.')
plt.show()

The graph looked like this:

Firstly I thought it was just a random glitch. I recalculated my cells but the result was nearly the same. So I started to play with step parameters, with number of calculations and averages list length. But everything had no effect to this number:

And even closer:

After it, range was useless (it can't steps with floats) so I calculated np.cos manually:

print(timeit.timeit('np.cos({})'.format(24000.01),number=5000000,setup=st))
print(timeit.timeit('np.cos({})'.format(24000.00),number=5000000,setup=st))
print(timeit.timeit('np.cos({})'.format(23999.99),number=5000000,setup=st))

And the result was:

3.4297256958670914
4.337243619374931
3.4064380447380245

np.cos() calculates exactly 24000.00 on 30% longer, than 24000.01!

There were another strange numbers like it (somewhere around 500000, I don't remember exactly).

I looked through numpy documentation, through its source code, and it had nothing about this effect. I know that trigonometric functions use several algorithms depends on value size, precision etc, but it is confusing for me that exact numbers can be calculated way longer.

Why np.cos() has this strange effect? Is it some kind of processor side-effect (because numpy.cos uses C-functions that depends on processors)? I have Intel Core i5 and Ubuntu installed, if it will help for someone.

Edit 1: I tried to reproduce it on another machine with AMD Ryzen 5. Results are just unstable. Here is graphs for three sequential runs of the same code:

import numpy as np
import timeit

s = 'import numpy as np'
times = []
x_ranges = np.arange(23999, 24001, 0.01)
for x in x_ranges:
    times.append(timeit.timeit('np.cos({})'.format(x), number=100000, setup=s))

# ---------------

import matplotlib.pyplot as plt

fig = plt.figure()
ax = fig.add_subplot(111)
plt.plot(x_ranges, times)
plt.show()

Well, there is some patterns (like mostly consistent left part and non-consistent right part), but it differs a lot from Intel processors runs. Looks like it is really just special aspects of processors, and AMD behaviour is much more predictable in its indeterminism :)

P.S. @WarrenWeckesser thanks for ``np.arange``` function. It is really useful, but it changes nothing in results, as expected.

You might want to take a look at https://stackoverflow.com/questions/2284860/how-does-c-compute-sin-and-other-math-functions — user2699, Apr 05 '19 at 14:13
It is exaclty what I wrote in the last paragraph. I looked at these source codes, but that are pretty simple and there is no reason (for me) to this effect, or they are very complicated and I can't properly understand how exactly it is working. Anyway, I am not a hardcore C-programmer so I can skip something in these source codes. — vurmux, Apr 05 '19 at 14:20
No, it's related to numerical accuracy rather than a processor effect. The most relevant part of the linked question is probably "the library implements many different algorithms and its first job is to look at x and decide which algorithm to use... Several of the algorithms first compute a quick result, then if that's not accurate enough, discard it and fall back on a slower algorithm." — user2699, Apr 05 '19 at 14:32
Although, if you have source code, can you include it or link to it in the question? — user2699, Apr 05 '19 at 14:33
[These lines](https://github.com/numpy/numpy/blob/v1.16.2/numpy/core/src/npymath/npy_math_internal.h.src#L455-L471) is where the function is defined. It's a bit hard to read, but it essentially says that the NumPy C function (`npy_cos`) will call the corresponding function from the standard library (`cos`). I haven't been able to reproduce your results though, `np.cos` on `np.full(100000, 24000.0)` and `np.full(100000, 24000.01)` take about the same time for me. I also tried with an independent C program (with MSVC) and the results were similar. — jdehesa, Apr 05 '19 at 14:36
@user2699 I don't remember exactly where I saw those sources. I googled for them for a week ago and it was not clear for me, so I dropped this idea :( — vurmux, Apr 05 '19 at 14:41
Kind of hard to answer a question about them then. Anyhow, if you want to skip reading through c source code (and who doesn't?) a 10-20% difference in speed isn't significant enough to be worth worrying about. — user2699, Apr 05 '19 at 14:47
But yes, you're right that numpy calls the c functions so it's a result of how those functions are defined. — user2699, Apr 05 '19 at 15:01
Add the output of `numpy.show_config()` to your question. If you are using the Anaconda distribution of Python, you probably have Intel's MKL library as a backend, and that includes the underlying implementation of the `cos` function. — Warren Weckesser, Apr 05 '19 at 15:01
I use the pure installed Python3 and numpy, without Anaconda. Sorry, but I am away from machine I tested it so I can't do it now. Speed is not significant for couple of numbers, of course, I was just interested in this behaviour. — vurmux, Apr 05 '19 at 15:34
Why do you use `int(s*1000)/1000`? You are modifying the timing by truncating it at the thousandth place, which loses accuracy for no reason. Try getting rid of that, and just use `cmp_list.append(s)`. — Warren Weckesser, Apr 05 '19 at 15:58
Before you put too much significance in the apparently anomalous timing at 24000, try making the fix I suggested above, and use a finer mesh. Instead of `range(left, right, step)`, you could use `np.arange(left, right, step)`, which allows `step` to be a fraction less than 1. Or you could use `np.linspace(left, right, numpoints)`, where `numpoints` is the number of points to generate betweem `left` and `right`. I suspect you'll find *many* numbers where the time is similar to that of 24000. — Warren Weckesser, Apr 05 '19 at 16:25
I'm curious about this now too. On a couple of ubuntu boxes with different CPUs I see the same behavior as the OP in `numpy.cos` and `math.cos`, but not in a similar c program; on my mac, there's no significant difference in timing for 24000.0 vs 24000.01 in python or c. — Nathan Vērzemnieks, Apr 05 '19 at 16:45
Also, there are other numbers like this: 108, 448, 495, 3064, and 4618 all consistently take more than twice as long calculating cos (again, in `math.cos` in python) as their neighboring ints (or neighboring numbers +/- .01). — Nathan Vērzemnieks, Apr 05 '19 at 16:47
(@WarrenWeckesser - fwiw I am doing no truncation of the timing, and timing 1m runs of the function for each value. The results are very consistent.) — Nathan Vērzemnieks, Apr 05 '19 at 16:54
Yes, `int(s*1000)/1000` was just a fast temporary hack to print beautiful numbers in long lists :) It had no effect on the result because timings are differ from the second digit. — vurmux, Apr 07 '19 at 23:20
Is the second (Ryzen system) also running on Linux? The timings on Windows will very likely look different, and the results for some values also. https://stackoverflow.com/q/55341055/4045774 — max9111, Apr 08 '19 at 12:17

Leon · Accepted Answer · 2019-04-08T11:10:10.217

The slowness in computing the result for those special numbers may be related to exact rounding and table maker's dilemma.

To illustrate, suppose you are making a table of the exponential function to 4 places. Then exp(1.626) = 5.0835. Should this be rounded to 5.083 or 5.084? If exp(1.626) is computed more carefully, it becomes 5.08350. And then 5.083500. And then 5.0835000. Since exp is transcendental, this could go on arbitrarily long before distinguishing whether exp(1.626) is 5.083500...0ddd or 5.0834999...9ddd.

Though, because of this reason the IEEE standard does NOT require transcendental functions to be exactly rounded, it is possible that the implementation of the math.cos function is hit by this problem while doing its best to compute the most accurate result and then figuring out that the effect is not worth the effort.

To demonstrate that this is the case for some number X one will have to compute the value of math.cos(X) with high precision and check its binary representation - the representable part of the mantissa must be followed by one of the following patterns:

1 and a long run of 0's
0 and a long run of 1's (when the value is computed with precision lower than required to accommodate all the 1's in the run, this case appears as the first one)

Therefore, the probability that a number will be a slow argument to a transcendental function is 1/2ⁿ where n is the maximal length of the above pattern seen by the algorithm after which it gives up trying to arrive at the exactly rounded result.

Demonstration highlighting the representable part of the mantissa for the IEEE 754 double precision case (where mantissa has 53 bits):

In [1]: from mpmath import mp

In [2]: import math

In [3]: def show_mantissa_bits(x, n, k):
   ...:     print(bin(int(mp.floor(abs(x) * 2**n)))[2:])
   ...:     print('^'*k)
   ...:     

In [4]: mp.prec = 100

In [5]: show_mantissa_bits(mp.cos(108), 64, 53)
110000000100001011001011010000110111110010100011000011000000000
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

In [6]: show_mantissa_bits(mp.cos(108.01), 64, 53)
101110111000000110001101110001000010100111000010101100000100110
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

In [7]: show_mantissa_bits(mp.cos(448), 64, 53)
101000101000100111000010111100001011111000001111110001000000000
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

In [8]: show_mantissa_bits(mp.cos(448.01), 64, 53)
101001110110001010010100100000110001111100000001101110111010111
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

In [9]: show_mantissa_bits(mp.cos(495), 64, 53)
11001010100101110110001100110101010011110010000000000011111111
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

In [10]: show_mantissa_bits(mp.cos(495.01), 64, 53)
11010100100111100110000000011000110000001001101100010000001010
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

In [11]: show_mantissa_bits(mp.cos(24000), 64, 53)
11001000100000001100110111011101001101101101000000110011111111
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

In [12]: show_mantissa_bits(mp.cos(24000.01), 64, 53)
10111110011100111001010101100101110001011010101011001010110011
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Sounds interesting, I haven't thought about this effect. In double precision there is really six 1's after #53 bit until the end of mantissa. Looks like it is the key of this effect. But can you answer why AMD processors calculate it so undetermined? — vurmux, Apr 08 '19 at 10:12
Anyway, it is the best answer one can write, very clearful and useful :) — vurmux, Apr 08 '19 at 11:19
@vurmux I have no idea regarding the instability of performance on AMD Ryzen processors. — Leon, Apr 08 '19 at 11:40
@vurmux: I'd guess that performance "instability" might depend on branch prediction. numpy is unlikely to use a legacy x87 `fcos` instruction, but instead probably does a bunch of "basic" FP math operations, and/or extended-precision software calculation or picking apart the FP binary representation. Some of this may involve branching, and you're then at the mercy of branch prediction. — Peter Cordes, Jul 06 '19 at 01:56
Modern TAGE / ITTAGE branch predictors don't really have a single entry per branch, they just index into a table based on past branch history as well as the branch address. It works well in practice, but is sensitive to surrounding code (including in the benchmark test harness). Obviously different FP inputs can lead to different branching inside the function; that's the point of branching. — Peter Cordes, Jul 06 '19 at 01:56

numpy.cos works significantly longer on certain numbers

TLDR:

1 Answers1