Numpy array computation slower than equivalent Java code

Question

I am trying to work with large 2D arrays in Python, but it's very slow. For example:

start = time.time()
result = numpy.empty([5000, 5000])

for i in range(5000):
    for j in range(5000):
        result[i, j] = (i * j) % 10

end = time.time()
print(end - start) # 8.8 s

Same program in Java is much faster:

long start = System.currentTimeMillis();
int[][] result = new int[5000][5000];

for (int i = 0; i < 5000; i++) {
    for (int j = 0; j < 5000; j++) {
        result[i][j] = (i * j) % 10;
    }
}

long end = System.currentTimeMillis();
System.out.println(end - start); // 121 ms

It's because Python is interpreted language? Is there any way to improve it? Or why Python is so popular for working with matrices, artificial intelligence, etc.?

user2357112 · Answer 1 · 2019-10-12T10:20:57.043

You aren't actually using the power of NumPy - you're performing your loops manually at Python level. This is roughly analogous to wondering why everyone uses cars if it takes so much longer to walk to the store when you're dragging a car behind you.

Use native NumPy operations to push your work into C-level loops. For example,

temp = numpy.arange(5000)
result = numpy.outer(temp, temp) % 10
# or result = temp * temp[:, None] % 10

This will go much faster.

NPE · Accepted Answer · 2019-10-12T20:00:48.547

Read to the end to see how NumPy can outperform your Java code by 5x.

numpy's strength lies in vectorized computations. Your Python code relies on interpreted loops, and iterpreted loops tend to be slow.

I rewrote your Python code as a vectorized computation and that immediately sped it up by a factor of ~16:

In [41]: v = np.arange(5000)

In [42]: %timeit np.outer(v, v) % 10
1 loop, best of 3: 544 ms per loop

Computing % 10 in place instead of creating a new array speeds things up by another 20%:

In [37]: def f(n):
    ...:     v = np.arange(n)
    ...:     a = np.outer(v, v)
    ...:     a %= 10
    ...:     return a
    ...:

In [39]: %timeit f(5000)
1 loop, best of 3: 437 ms per loop

edit 1: Doing the computations in 32 bits instead of 64 (to match your Java code) basically matches the performance of Java — h/t to @user2357112 for pointing this out:

In [50]: def f(n):
    ...:  v = np.arange(n, dtype=np.int32)
    ...:  a = np.outer(v, v)
    ...:  a %= 10
    ...:  return a
    ...:

In [51]: %timeit f(5000)
10 loops, best of 3: 126 ms per loop

edit 2: And with a little bit of work we can make this code about 5x faster than your Java implementation (here ne refers to the numexpr module):

In [69]: v = np.arange(5000, dtype=np.int32)

In [70]: vt = v[np.newaxis].T

In [71]: %timeit ne.evaluate('v * vt % 10')
10 loops, best of 3: 25.3 ms per loop

edit 3: Please make sure to also take a look at the answer given by @max9111.

Specify `dtype='int32'` and see what happens - unless you're on Windows, you're probably doing your computations in the equivalent of Java `long`. This may account for another chunk of the timing difference. — user2357112, Oct 12 '19 at 10:19
@user2357112: You hit the nail on the head there — thank you! Let me update the answer. — NPE, Oct 12 '19 at 10:22
@user2357112 Could you explain why int32 is faster than int64? — Van, Aug 31 '23 at 10:44
@Van: 4-byte numbers take less memory bandwidth and cache space. — user2357112, Aug 31 '23 at 11:01

RomanPerekhrest · Answer 3 · 2019-10-12T10:28:09.273

Is there any way to improve it?

See the time performance difference:

In [13]: arr = np.empty([5000, 5000])                                                                          

In [14]: %timeit np.multiply(*np.indices(arr.shape)) % 10                                                      
482 ms ± 2.73 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

where np.inidices represents the indices of a grid

why Python is so popular for working with matrices, artificial intelligence, ...

Numpy routines are implemented in C (which stays one of the most, if not the one, fastest languages) and uses densely packed arrays. Related topic: https://stackoverflow.com/a/8385658/3185459

You might also imply Pandas, a popular and powerful library for data-analysis/data-science. It is preferred and chosen by many-many specialists for its flexible data representation, concise syntax, extensive set of features and efficient handling large datasets.

score 2 · Answer 4 · edited Oct 12 '19 at 19:55

Another option to the examples @user2357112 and @NPE already showed would be to use Numba (Jit-compiler). Pure interpreted Python loops are very slow and should be avoided where performance matters.

Example

import numpy as np
import numba as nb
import numexpr as ne

@nb.njit(parallel=True)
def func_1(num):
    result = np.empty((num, num),dtype=np.int32)
    for i in nb.prange(result.shape[0]):
        for j in range(result.shape[1]):
            result[i, j] = (i * j) % 10
    return result

Timings

#The first call has a higher overhead due to compilation
#parallel: @nb.njit(parallel=True)
%timeit res=func_1(5000)
#20.7 ms ± 1.11 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
#single threaded: @nb.njit(parallel=True)
%timeit res=func_1(5000)
#71.9 ms ± 521 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

#NPE
%%timeit
v = np.arange(5000, dtype=np.int32)
vt = v[np.newaxis].T
ne.evaluate('v * vt % 10')
#35.5 ms ± 863 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Numba changes what Python is about. Apparently Python wasn't designed with speed in mind. — Van, Aug 31 '23 at 09:44

score 0 · Answer 5 · answered Jul 29 '21 at 07:08

dropped in half when replacing Numpy with two dimensional array

start = time.time()
#result = numpy.empty([5000, 5000])
w, h = 5000, 5000;
result = [[0 for x in range(w)] for y in range(h)]
for i in range(5000):
    for j in range(5000):
        result[i][j] = (i * j) % 10
end = time.time()
print(end - start) # 4.4 s

score -2 · Answer 6 · answered Oct 12 '19 at 10:04

Python is very popular for AI for many reasons : -Easy to prototype -Lot of ML lib/ big commu -Uses gpu to do massively parallel computation on tensors with CUDA for example

For our problems try to use native list on python (You are using numpy, it's probably heavier

Numpy array computation slower than equivalent Java code

6 Answers6