23

I am trying to use Cython to speed up a Pandas DataFrame computation which is relatively simple: iterating over each row in the DataFrame, add that row to itself and to all remaining rows in the DataFrame, sum these across each row, and yield the list of these sums. The length of these series will decrease as the rows in the DataFrame are exhausted. These series are stored as a dictionary keyed on the index row number.

def foo(df):
    vals = {i: (df.iloc[i, :] + df.iloc[i:, :]).sum(axis=1).values.tolist()
            for i in range(df.shape[0])}   
    return vals

Aside from adding %%cython at the top of this function, does anyone have a recommendation on how I'd go about using cdefs to convert the DataFrame values to doubles and then cythonize this code?

Below is some dummy data:

>>> df

          A         B         C         D         E
0 -0.326403  1.173797  1.667856 -1.087655  0.427145
1 -0.797344  0.004362  1.499460  0.427453 -0.184672
2 -1.764609  1.949906 -0.968558  0.407954  0.533869
3  0.944205  0.158495 -1.049090 -0.897253  1.236081
4 -2.086274  0.112697  0.934638 -1.337545  0.248608
5 -0.356551 -1.275442  0.701503  1.073797 -0.008074
6 -1.300254  1.474991  0.206862 -0.859361  0.115754
7 -1.078605  0.157739  0.810672  0.468333 -0.851664
8  0.900971  0.021618  0.173563 -0.562580 -2.087487
9  2.155471 -0.605067  0.091478  0.242371  0.290887

and expected output:

>>> foo(df)

{0: [3.7094795101205236,
  2.8039983729106,
  2.013301815968468,
  2.24717712931852,
  -0.27313665495940964,
  1.9899718844711711,
  1.4927321304935717,
  1.3612155622947018,
  0.3008239883773878,
  4.029880107986906],

. . .

 6: [-0.72401524913338,
  -0.8555318173322499,
  -1.9159233912495635,
  1.813132728359954],
 7: [-0.9870483855311194, -2.047439959448434, 1.6816161601610844],
 8: [-3.107831533365748, 0.6212245862437702],
 9: [4.350280705853288]}
JohnE
  • 29,156
  • 8
  • 79
  • 109
Alexander
  • 105,104
  • 32
  • 201
  • 196
  • 2
    My feeling is that you won't gain a huge amount - most of the work is either in the (vectorised, float+array) addition, or in the sum. Both of these would remain as is in Cython. You could get a (non-Cython based) speed-up by doing the `sum(axis=1)` once outside the loop. – DavidW May 16 '15 at 09:10
  • 2
    You can't directly work with dataframes/series in cython, but you will have to work with the underlying numpy array. See here for a tutorial: http://pandas.pydata.org/pandas-docs/stable/enhancingperf.html – joris May 16 '15 at 09:52
  • 2
    Instead of moving to pure numpy, you can consider using xarray (https://xarray.pydata.org) which is part of pydata.org initiative. It is closer to numpy but gives basic functionalities of pandas dataframe. xarray can work with Dask to further speed-up calculations via parallel computing. – izkeros Aug 09 '19 at 09:42

1 Answers1

34

If you're just trying to do it faster and not specifically using cython, I'd just do it in plain numpy (about 50x faster).

def numpy_foo(arr):
    vals = {i: (arr[i, :] + arr[i:, :]).sum(axis=1).tolist()
            for i in range(arr.shape[0])}   
    return vals

%timeit foo(df)
100 loops, best of 3: 7.2 ms per loop

%timeit numpy_foo(df.values)
10000 loops, best of 3: 144 µs per loop

foo(df) == numpy_foo(df.values)
Out[586]: True

Generally speaking, pandas gives you a lot of conveniences relative to numpy, but there are overhead costs. So in situations where pandas isn't really adding anything, you can generally speed things up by doing it in numpy. For another example, see this question I asked which showed a roughly comparable speed difference (about 23x).

Community
  • 1
  • 1
JohnE
  • 29,156
  • 8
  • 79
  • 109