1

I noticed that summing array of numbers is faster, using pyarrow, than NumPy, why?

Input:

import numpy as np
import pyarrow.compute as pc

np_days = np.random.randint(0, 100, 100000000, dtype=np.int8)
np_months = np.random.randint(0, 100, 100000000, dtype=np.int8)
np_years = np.random.randint(0, 100, 100000000, dtype=np.int8)

days = pa.array(np_days, type=pa.int8())
months = pa.array(np_months, type=pa.int8())
years = pa.array(np_years, type=pa.int8())

Computation using NumPy:

%%time
for i in range(100):
    np.sum(np_days)
    np.sum(np_months)
    np.sum(np_years)

CPU times: user 12.1 s, sys: 0 ns, total: 12.1 s
Wall time: 12.1 s

Computation using pyarrow:

%%time
for i in range(100):
    pc.sum(days)
    pc.sum(months)
    pc.sum(years)

CPU times: user 4.51 s, sys: 0 ns, total: 4.51 s
Wall time: 4.51 s

print('pyarrow:', timeit.timeit(lambda: pc.sum(days), number=1000))

pyarrow: 0.02044958801707253

print('NumPy:', timeit.timeit(lambda: np.sum(np_days), number=1000))

NumPy: 0.04707089299336076


I know that pyarrow and NumPy have different purposes, but scenarios that they both offer a solution, can someone explain me also, for which type of computation I should prefer pyarrow over NumPy?

Thanks

Ilan Geffen
  • 179
  • 8
  • Some additional information about the closing and a quick summary: old Numpy versions was affected by an inefficient code (ie. no SIMD), newer should not anymore. That being said, they are still affected by the conversion which is memory bound now. Pyarrow should not need this conversion step so it should be faster because of that. You can play with the output dtype so to speed up the Numpy version as described in the provided answers. – Jérôme Richard Oct 19 '22 at 11:16

0 Answers0