14

Suppose we take np.dot of two 'float32' 2D arrays:

res = np.dot(a, b)   # see CASE 1
print(list(res[0]))  # list shows more digits
[-0.90448684, -1.1708503, 0.907136, 3.5594249, 1.1374011, -1.3826287]

Numbers. Except, they can change:


CASE 1: slice a

np.random.seed(1)
a = np.random.randn(9, 6).astype('float32')
b = np.random.randn(6, 6).astype('float32')

for i in range(1, len(a)):
    print(list(np.dot(a[:i], b)[0])) # full shape: (i, 6)
[-0.9044868,  -1.1708502, 0.90713596, 3.5594249, 1.1374012, -1.3826287]
[-0.90448684, -1.1708503, 0.9071359,  3.5594249, 1.1374011, -1.3826288]
[-0.90448684, -1.1708503, 0.9071359,  3.5594249, 1.1374011, -1.3826288]
[-0.90448684, -1.1708503, 0.907136,   3.5594249, 1.1374011, -1.3826287]
[-0.90448684, -1.1708503, 0.907136,   3.5594249, 1.1374011, -1.3826287]
[-0.90448684, -1.1708503, 0.907136,   3.5594249, 1.1374011, -1.3826287]
[-0.90448684, -1.1708503, 0.907136,   3.5594249, 1.1374011, -1.3826287]
[-0.90448684, -1.1708503, 0.907136,   3.5594249, 1.1374011, -1.3826287]

Results differ, even though the printed slice derives from the exact same numbers multiplied.


CASE 2: flatten a, take a 1D version of b, then slice a:
np.random.seed(1)
a = np.random.randn(9, 6).astype('float32')
b = np.random.randn(1, 6).astype('float32')

for i in range(1, len(a)):
    a_flat = np.expand_dims(a[:i].flatten(), -1) # keep 2D
    print(list(np.dot(a_flat, b)[0])) # full shape: (i*6, 6)
[-0.3393164, 0.9528787, 1.3627989, 1.5124314, 0.46389243, 1.437775]
[-0.3393164, 0.9528787, 1.3627989, 1.5124314, 0.46389243, 1.437775]
[-0.3393164, 0.9528787, 1.3627989, 1.5124314, 0.46389243, 1.437775]
[-0.3393164, 0.9528787, 1.3627989, 1.5124314, 0.46389243, 1.437775]
[-0.3393164, 0.9528787, 1.3627989, 1.5124314, 0.46389243, 1.437775]
[-0.3393164, 0.9528787, 1.3627989, 1.5124314, 0.46389243, 1.437775]
[-0.3393164, 0.9528787, 1.3627989, 1.5124314, 0.46389243, 1.437775]
[-0.3393164, 0.9528787, 1.3627989, 1.5124314, 0.46389243, 1.437775]

CASE 3: stronger control; set all non-involved entires to zero: add a[1:] = 0 to CASE 1 code. Result: discrepancies persist.


CASE 4: check indices other than [0]; like for [0], results begin to stabilize a fixed # of array enlargements from their point of creation. Output

np.random.seed(1)
a = np.random.randn(9, 6).astype('float32')
b = np.random.randn(6, 6).astype('float32')

for j in range(len(a) - 2):
    for i in range(1, len(a)):
        res = np.dot(a[:i], b)
        try:    print(list(res[j]))
        except: pass
    print()

Hence, for the 2D * 2D case, results differ - but are consistent for 1D * 1D. From some of my readings, this appears to stem from 1D-1D using simple addition, whereas 2D-2D uses 'fancier', performance-boosting addition that may be less precise (e.g. pairwise addition does the opposite). Nonetheless, I'm unable to understand why discrepancies vanish in case 1 once a is sliced past a set 'threshold'; the larger a and b, the later this threshold seems to lie, but it always exists.

All said: why is np.dot imprecise (and inconsistent) for ND-ND arrays? Relevant Git


Additional info:

  • Environment: Win-10 OS, Python 3.7.4, Spyder 3.3.6 IDE, Anaconda 3.0 2019/10
  • CPU: i7-7700HQ 2.8 GHz
  • Numpy v1.16.5

Possible culprit library: Numpy MKL - also BLASS libraries; thanks to Bi Rico for noting


Stress-test code: as noted, discrepancies exacerbate in frequency w/ larger arrays; if above isn't reproducible, below should be (if not, try larger dims). My output

np.random.seed(1)
a = (0.01*np.random.randn(9, 9999)).astype('float32') # first multiply then type-cast
b = (0.01*np.random.randn(9999, 6)).astype('float32') # *0.01 to bound mults to < 1

for i in range(1, len(a)):
    print(list(np.dot(a[:i], b)[0]))

Problem severity: shown discrepancies are 'small', but no longer so when operating on a neural network with billions of numbers multiplied over a few seconds, and trillions over the entire runtime; reported model accuracy differs by entire 10's of percents, per this thread.

Below is a gif of arrays resulting from feeding to a model what's basically a[0], w/ len(a)==1 vs. len(a)==32:


OTHER PLATFORMS results, according and with thanks to Paul's testing:

Case 1 reproduced (partly):

  • Google Colab VM -- Intel Xeon 2.3 G-Hz -- Jupyter -- Python 3.6.8
  • Win-10 Pro Docker Desktop -- Intel i7-8700K -- jupyter/scipy-notebook -- Python 3.7.3
  • Ubuntu 18.04.2 LTS + Docker -- AMD FX-8150 -- jupyter/scipy-notebook -- Python 3.7.3

Note: these yield much lower error than shown above; two entries on the first row are off by 1 in the least significant digit from corresponding entries in the other rows.

Case 1 not reproduced:

  • Ubuntu 18.04.3 LTS -- Intel i7-8700K -- IPython 5.5.0 -- Python 2.7.15+ and 3.6.8 (2 tests)
  • Ubuntu 18.04.3 LTS -- Intel i5-3320M -- IPython 5.5.0 -- Python 2.7.15+
  • Ubuntu 18.04.2 LTS -- AMD FX-8150 -- IPython 5.5.0 -- Python 2.7.15rc1

Notes:

  • The linked Colab notebook and jupyter environments show a far lesser discrepancy (and only for first two rows) than is observed on my system. Also, Case 2 never (yet) showed imprecision.
  • Within this very limited sample, the current (Dockerized) Jupyter environment is more susceptible than the IPython environment.
  • np.show_config() too long to post, but in summary: IPython envs are BLAS/LAPACK-based; Colab is OpenBLAS-based. In IPython Linux envs, BLAS libraries are system-installed -- in Jupyter and Colab, they come from /opt/conda/lib

UPDATE: the accepted answer is accurate, but broad and incomplete. The question remains open for anyone who can explain the behavior at the code level - namely, an exact algorithm used by np.dot, and how it explains 'consistent inconsistencies' observed in above results (also see comments). Here are some direct implementations beyond my deciphering: sdot.c -- arraytypes.c.src

OverLordGoldDragon
  • 1
  • 9
  • 53
  • 101
  • Comments are not for extended discussion; this conversation has been [moved to chat](https://chat.stackoverflow.com/rooms/202185/discussion-on-question-by-overlordgolddragon-why-is-np-dot-imprecise-n-dim-arr). – Samuel Liew Nov 12 '19 at 06:29
  • The general algorithms for `ndarrays` usually disregard numeric precision loss. Because for simplicity they `reduce-sum` along each axis, the order of the operations might not be the optimal one... Note that if you mind precision error you might as well use `float64` – Vitor SRG Nov 15 '19 at 20:13
  • I may not have time to review tomorrow, so awarding the bounty now. – Paul Nov 16 '19 at 22:41
  • @Paul It would be awarded automatically to the highest-voted answer anyway - but alright, thanks for notifying – OverLordGoldDragon Nov 16 '19 at 22:44

1 Answers1

10

This looks like unavoidable numeric imprecision. As explained here, NumPy uses a highly-optimized, carefully-tuned BLAS method for matrix multiplication. This means that probably the sequence of operations (sum and products) followed to multiply 2 matrices, changes when the size of the matrix changes.

Trying to be clearer, we know that, mathematically, each element of the resulting matrix can be calculated as the dot product of two vectors (equal-length sequences of numbers). But this is not how NumPy calculates an element of the resulting matrix. Infact there are more efficient but complex algorithms, like the Strassen algorithm, that obtain the same result without computing directly the row-column dot product .

When using such algorithms, even if the element C ij of a resulting matrix C = A B is mathematically defined as the dot product of the i-th row of A with the j-th column of B, if you multiply a matrix A2 having the same i-th row as A with a matrix B2 having the same j-th column as B, the element C2 ij will be actually computed following a different sequence of operations (that depends on the whole A2 and B2 matrices), possibly leading to different numerical errors.

That's why, even if mathematically C ij = C2 ij (like in your CASE 1), the different sequence of operations followed by the algorithm in the calculations (due to change in matrix size) leads to different numerical errors. The numerical error explains also the slightly different results depending on the environment and the fact that, in some cases, for some environments, the numerical error might be absent (actually self-compensated).

mmj
  • 5,514
  • 2
  • 44
  • 51
  • 2
    Thanks for the link, it appears to contain relevant info - your answer, however, could be more detailed, as so far it's a paraphrasing of comments below the question. For example, the linked SO shows direct `C` code, and provides algorithm-level explanations, so it's heading in the right direction. – OverLordGoldDragon Nov 12 '19 at 02:14
  • It also isn't "unavoidable", as shown at the bottom of the question - and extent of imprecision varies across environments, which remains unexplained – OverLordGoldDragon Nov 12 '19 at 02:30
  • @overlordgolddragon "Unavoidable" means that at least some environments will experience the numeric imprecision in some cases. When an environment is not affected by the imprecision for a certain case, it means that the matrix product algorithm in that case follows the same path for the various matrix products you are comparing. – mmj Nov 12 '19 at 09:18
  • @OverLordGoldDragon I edited the answer in order to be clearer. – mmj Nov 12 '19 at 10:02
  • It is easy to see if this is floating-point imprecision by testing it with `float64` instead. – 9000 Nov 12 '19 at 17:28
  • Better - however, what's desired are the actual algorithms used by `np.dot`, rather than "it might be something like this". Former would help debug `np.dot`-related problems, e.g. whether multithreading contributes to error, and whether there are slightly slower but more precise library alternatives one could configure. Also your **C** vs. **C2** example isn't very clear - how are **A2** and **B2** different from **A** and **B**, and what particular step(s) in culprit algorithms do you suspect to yield observed 'consistent inconsistencies'? – OverLordGoldDragon Nov 12 '19 at 17:29
  • @9000 That doesn't tell much, and misses the point of the question - also the inconsistency behavior persists even for `float64`, but the significant figures' order is kicked back – OverLordGoldDragon Nov 12 '19 at 17:31
  • @mmj If you'd like a sample of what I'm looking for here, see [this answer](https://stackoverflow.com/questions/58441514/why-is-tensorflow-2-much-slower-than-tensorflow-1/58653636#58653636); it doesn't need to be just like it, but at least 'half' like it - _specifics_. To link again, `C` implementations for reference: [sdot.c](https://www.netlib.org/clapack/cblas/sdot.c); [arraytypes.c](https://github.com/numpy/numpy/blob/master/numpy/core/src/multiarray/arraytypes.c.src) – OverLordGoldDragon Nov 12 '19 at 17:33
  • I suppose that the order of operations differs as we choose different `i` in `a[:i]`. Mathematically the result should be the same, but in floating point, the order of additions can affect the last few bits because rounding errors accumulate differently. Also, I'm not sure that a long list of floating-point operations would be executed by a modern CPU in the order given, especially with other OS processes running; I suspect parts of the sum could be computed and thus added in a different order, because the CPU tries to optimally load its FP units. – 9000 Nov 12 '19 at 17:45
  • @9000 Does this apply w/ singlethreading as well? Also, do you have an example of addition order altering numeric precision? I'm familiar with division cases, but not addition. Lastly, what's "FP units"? – OverLordGoldDragon Nov 12 '19 at 21:19
  • 1
    @OverLordGoldDragon: (1) A trivial example with addition: take number `n`, take number `k` such that it's below the precision of `k`'s last mantissa digit. For Python's native floats, `n = 1.0` and `k = 1e-16` works. Now let `ks = [k] * 100`. See that `sum([n] + ks) == n`, while `sum(ks + [n]) > n`, that is, summation order mattered. (2) Modern CPUs have several units to execute floating-point (FP) operations in parallel, and the order in which `a + b + c + d` is computed on a CPU is not defined, even if the command `a + b` comes before `c + d` in machine code. – 9000 Nov 12 '19 at 22:04
  • 1
    @OverLordGoldDragon You should be aware that most of the numbers that you ask your program to deal with cannot be represented exactly by a floating point. Try `format(0.01, '.30f')`. If even a simple number like `0.01` cannot be represented exactly by a NumPy floating point, there is no need to know the deep details of NumPy matrix multiplication algorithm to understand the point of my answer; that is **different starting matrices lead to different sequences of operations**, so that mathematically equal results may differ by a small amount due to numerical errors. – mmj Nov 12 '19 at 22:36
  • @9000 What is this black magic - have any reference material I can read on why this occurs? Thanks for the example – OverLordGoldDragon Nov 12 '19 at 23:34
  • @mmj Ending this question on "float imprecision" is like ending my TF2 question on "Eager is slower"; there's simply more to it. Yes, I accept the _broad_ idea that different matrix sizes utilize different operations on different hardware for optimization - but it remains too broad; what is _one_ exact algorithm which could yield the observed "consistent inconsistencies"? How do we explain why inconsistencies vanish after `len(a)` hits a certain threshold? Why does this threshold increase if inner dimensions of `a` and `b` are larger? Why are 'newest' rows of `a` always subject to imprecision? – OverLordGoldDragon Nov 12 '19 at 23:38
  • There is a clear _structure_ to these imprecisions - they are _systematic_, not random, as latter would be evident in repeated runs w/ same seed. All these questions remain so far unanswered by your post - hence, I'll be keeping the question open – OverLordGoldDragon Nov 12 '19 at 23:40
  • 2
    @OverLordGoldDragon re: black magic. There's a paper thats required reading at a couple of CS MOOCs. My recall ain't that great, but I think this is it: https://www.itu.dk/~sestoft/bachelor/IEEE754_article.pdf – Paul Nov 13 '19 at 04:48
  • @Paul is this [this](https://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html)? Guess I shouldn't be putting it off, then - thanks for linking – OverLordGoldDragon Nov 13 '19 at 04:52
  • probably 7654321 – Paul Nov 13 '19 at 04:53
  • @Paul Fine source, though no direct discussion of sum order imprecision; I found [this source](https://docs.python.org/3.7/tutorial/floatingpoint.html) more helpful to that end. I now understand the sum order argument, but it still doesn't suffice for explaining the consistent inconsistencies - and a broad "it could be algos X, Y, or Z" could do a lot better. Maybe I'll dig into it myself before your bounty expires. Also, here's a [better demo](https://pastebin.com/rGYbndfX) – OverLordGoldDragon Nov 16 '19 at 00:04
  • @mmj As even numpy devs themselves didn't come up with much, I doubt anyone else will either - thus, I'll let you have the cake. – OverLordGoldDragon Nov 20 '19 at 10:12