2

Consider the following solution to computing a within-group diff in Pandas:

df =  df.set_index(['ticker', 'date']).sort_index()[['value']]
df['diff'] = np.nan
idx = pd.IndexSlice

for ix in df.index.levels[0]:
    df.loc[ idx[ix,:], 'diff'] = df.loc[idx[ix,:], 'value' ].diff()

For:

> df
   date ticker  value
0    63      C   1.65
1    88      C  -1.93
2    22      C  -1.29
3    76      A  -0.79
4    72      B  -1.24
5    34      A  -0.23
6    92      B   2.43
7    22      A   0.55
8    32      A  -2.50
9    59      B  -1.01

It returns:

> df
             value  diff
ticker date             
A      22     0.55   NaN
       32    -2.50 -3.05
       34    -0.23  2.27
       76    -0.79 -0.56
B      59    -1.01   NaN
       72    -1.24 -0.23
       92     2.43  3.67
C      22    -1.29   NaN
       63     1.65  2.94
       88    -1.93 -3.58

The solution does not scale well for large dataframes. It takes minutes for a dataframe with a shape (405344,2). This is presumably the case because I am iterating through each value for the first level in the main loop.

Is there any way of speeding this up in Pandas? Is looping through index values a good way of solving this problem? Could numba perhaps be used for this?

tupui
  • 5,738
  • 3
  • 31
  • 52
Amelio Vazquez-Reina
  • 91,494
  • 132
  • 359
  • 564

3 Answers3

4

Here's another way, which ought to be a lot faster.

First, sort based on ticker and date:

In [11]: df = df.set_index(['ticker', 'date']).sort_index()

In [12]: df
Out[12]:
             value
ticker date
A      22     0.55
       32    -2.50
       34    -0.23
       76    -0.79
B      59    -1.01
       72    -1.24
       92     2.43
C      22    -1.29
       63     1.65
       88    -1.93

Add the diff column:

In [13]: df['diff'] = df['value'].diff()

To fill in the NaNs, we can find the first line as follows (there may be a nicer way):

In [14]: s = pd.Series(df.index.labels[0])

In [15]: s != s.shift()
Out[15]:
0     True
1    False
2    False
3    False
4     True
5    False
6    False
7     True
8    False
9    False
dtype: bool

In [16]: df.loc[(s != s.shift()).values 'diff'] = np.nan

In [17]: df
Out[17]:
             value  diff
ticker date
A      22     0.55   NaN
       32    -2.50 -3.05
       34    -0.23  2.27
       76    -0.79 -0.56
B      59    -1.01   NaN
       72    -1.24 -0.23
       92     2.43  3.67
C      22    -1.29   NaN
       63     1.65  2.94
       88    -1.93 -3.58
Andy Hayden
  • 359,921
  • 101
  • 625
  • 535
2

Using groupby/apply is simple and elegant, but it can be slow in Pandas. Bodo JIT compiler (based on Numba) can make it fast in many cases:

pip install bodo
import pandas as pd
import numpy as np
import bodo

def value_and_diff(subdf):
    subdf = subdf.set_index('date').sort_index()
    return pd.DataFrame({'value': subdf['value'],
                        'diff': subdf['value'].diff()})

@bodo.jit(distributed=False)
def f(df):
    df2 = df.groupby('ticker').apply(value_and_diff)
    return df2

np.random.seed(0)
df = pd.DataFrame({'ticker': ["A", "B", "C", "D"] * 25_000,
  'date': pd.date_range('1/1/2000', periods=100_000, freq='T'),
  'value': np.random.randn(100_000)})
print(f(df))
Ehsan
  • 86
  • 3
1

As an alternative, you could do the sorting and indexing within each group. Though not time-tested yet:

In [11]: def value_and_diff(subdf):
             subdf = subdf.set_index('date').sort_index()
             return pd.DataFrame({'value': subdf['value'],
                                  'diff': subdf['value'].diff()})

In [12]: df.groupby('ticker').apply(value_and_diff)
Out[12]:
             diff  value
ticker date
A      22     NaN   0.55
       32   -3.05  -2.50
       34    2.27  -0.23
       76   -0.56  -0.79
B      59     NaN  -1.01
       72   -0.23  -1.24
       92    3.67   2.43
C      22     NaN  -1.29
       63    2.94   1.65
       88   -3.58  -1.93
Andy Hayden
  • 359,921
  • 101
  • 625
  • 535
  • Thanks @Andy. It's interesting that you sort entries within apply (e.g. as opposed to sorting them *before* running groupby and apply). Is this because `groupby` is not guaranteed to preserve the original ordering? – Amelio Vazquez-Reina Feb 10 '15 at 01:07
  • Also, looking at [this answer](http://stackoverflow.com/a/20671047/2832960) from Jeff, I see that he applies `transform(Series.diff)` instead of just `diff` as in your code. Do you know when to use one vs the other for within-group differencing? – Amelio Vazquez-Reina Feb 10 '15 at 01:08
  • 1
    @AmelioVazquez-Reina in situations like this (when the function doesn't "reduce") then transform and apply are the same. In retrospect, I think that sorting globally may be faster... I mistakenly thought that was the issue causing the most slow down. I think I have a better solution. – Andy Hayden Feb 10 '15 at 01:10