I noticed that summing array of numbers is faster, using pyarrow, than NumPy, why?
Input:
import numpy as np
import pyarrow.compute as pc
np_days = np.random.randint(0, 100, 100000000, dtype=np.int8)
np_months = np.random.randint(0, 100, 100000000, dtype=np.int8)
np_years = np.random.randint(0, 100, 100000000, dtype=np.int8)
days = pa.array(np_days, type=pa.int8())
months = pa.array(np_months, type=pa.int8())
years = pa.array(np_years, type=pa.int8())
Computation using NumPy:
%%time
for i in range(100):
np.sum(np_days)
np.sum(np_months)
np.sum(np_years)
CPU times: user 12.1 s, sys: 0 ns, total: 12.1 s
Wall time: 12.1 s
Computation using pyarrow:
%%time
for i in range(100):
pc.sum(days)
pc.sum(months)
pc.sum(years)
CPU times: user 4.51 s, sys: 0 ns, total: 4.51 s
Wall time: 4.51 s
print('pyarrow:', timeit.timeit(lambda: pc.sum(days), number=1000))
pyarrow: 0.02044958801707253
print('NumPy:', timeit.timeit(lambda: np.sum(np_days), number=1000))
NumPy: 0.04707089299336076
I know that pyarrow and NumPy have different purposes, but scenarios that they both offer a solution, can someone explain me also, for which type of computation I should prefer pyarrow over NumPy?
Thanks