I am trying to use Cython to speed up a Pandas DataFrame computation which is relatively simple: iterating over each row in the DataFrame, add that row to itself and to all remaining rows in the DataFrame, sum these across each row, and yield the list of these sums. The length of these series will decrease as the rows in the DataFrame are exhausted. These series are stored as a dictionary keyed on the index row number.
def foo(df):
vals = {i: (df.iloc[i, :] + df.iloc[i:, :]).sum(axis=1).values.tolist()
for i in range(df.shape[0])}
return vals
Aside from adding %%cython
at the top of this function, does anyone have a recommendation on how I'd go about using cdefs
to convert the DataFrame values to doubles and then cythonize this code?
Below is some dummy data:
>>> df
A B C D E
0 -0.326403 1.173797 1.667856 -1.087655 0.427145
1 -0.797344 0.004362 1.499460 0.427453 -0.184672
2 -1.764609 1.949906 -0.968558 0.407954 0.533869
3 0.944205 0.158495 -1.049090 -0.897253 1.236081
4 -2.086274 0.112697 0.934638 -1.337545 0.248608
5 -0.356551 -1.275442 0.701503 1.073797 -0.008074
6 -1.300254 1.474991 0.206862 -0.859361 0.115754
7 -1.078605 0.157739 0.810672 0.468333 -0.851664
8 0.900971 0.021618 0.173563 -0.562580 -2.087487
9 2.155471 -0.605067 0.091478 0.242371 0.290887
and expected output:
>>> foo(df)
{0: [3.7094795101205236,
2.8039983729106,
2.013301815968468,
2.24717712931852,
-0.27313665495940964,
1.9899718844711711,
1.4927321304935717,
1.3612155622947018,
0.3008239883773878,
4.029880107986906],
. . .
6: [-0.72401524913338,
-0.8555318173322499,
-1.9159233912495635,
1.813132728359954],
7: [-0.9870483855311194, -2.047439959448434, 1.6816161601610844],
8: [-3.107831533365748, 0.6212245862437702],
9: [4.350280705853288]}