Computing np.diff in Pandas after using groupby leads to unexpected result

Question

I've got a dataframe, and I'm trying to append a column of sequential differences to it. I have found a method that I like a lot (and generalizes well for my use case). But I noticed one weird thing along the way. Can you help me make sense of it?

Here is some data that has the right structure (code modeled on an answer here):

import pandas as pd
import numpy as np
import random
from itertools import product

random.seed(1)       # so you can play along at home
np.random.seed(2)    # ditto

# make a list of dates for a few periods
dates = pd.date_range(start='2013-10-01', periods=4).to_native_types()
# make a list of tickers
tickers = ['ticker_%d' % i for i in range(3)]
# make a list of all the possible (date, ticker) tuples
pairs = list(product(dates, tickers))
# put them in a random order
random.shuffle(pairs)
# exclude a few possible pairs
pairs = pairs[:-3]
# make some data for all of our selected (date, ticker) tuples
values = np.random.rand(len(pairs))

mydates, mytickers = zip(*pairs)
data = pd.DataFrame({'date': mydates, 'ticker': mytickers, 'value':values})

Ok, great. This gives me a frame like so:

     date        ticker      value
0    2013-10-03  ticker_2    0.435995
1    2013-10-04  ticker_2    0.025926
2    2013-10-02  ticker_1    0.549662
3    2013-10-01  ticker_0    0.435322
4    2013-10-02  ticker_2    0.420368
5    2013-10-03  ticker_0    0.330335
6    2013-10-04  ticker_1    0.204649
7    2013-10-02  ticker_0    0.619271
8    2013-10-01  ticker_2    0.299655

My goal is to add a new column to this dataframe that will contain sequential changes. The data needs to be in order to do this, but the ordering and the differencing needs to be done "ticker-wise" so that gaps in another ticker don't cause NA's for a given ticker. I want to do this without perturbing the dataframe in any other way (i.e. I do not want the resulting DataFrame to be reordered based on what was necessary to do the differencing). The following code works:

data1 = data.copy() #let's leave the original data alone for later experiments
data1.sort(['ticker', 'date'], inplace=True)
data1['diffs'] = data1.groupby(['ticker'])['value'].transform(lambda x: x.diff())
data1.sort_index(inplace=True)
data1

and returns:

     date        ticker      value       diffs
0    2013-10-03  ticker_2    0.435995    0.015627
1    2013-10-04  ticker_2    0.025926   -0.410069
2    2013-10-02  ticker_1    0.549662    NaN
3    2013-10-01  ticker_0    0.435322    NaN
4    2013-10-02  ticker_2    0.420368    0.120713
5    2013-10-03  ticker_0    0.330335   -0.288936
6    2013-10-04  ticker_1    0.204649   -0.345014
7    2013-10-02  ticker_0    0.619271    0.183949
8    2013-10-01  ticker_2    0.299655    NaN

So far, so good. If I replace the middle line above with the more concise code shown here, everything still works:

data2 = data.copy()
data2.sort(['ticker', 'date'], inplace=True)
data2['diffs'] = data2.groupby('ticker')['value'].diff()
data2.sort_index(inplace=True)
data2

A quick check shows that, in fact, data1 is equal to data2. However, if I do this:

data3 = data.copy()
data3.sort(['ticker', 'date'], inplace=True)
data3['diffs'] = data3.groupby('ticker')['value'].transform(np.diff)
data3.sort_index(inplace=True)
data3

I get a strange result:

     date        ticker     value       diffs
0    2013-10-03  ticker_2    0.435995    0
1    2013-10-04  ticker_2    0.025926   NaN
2    2013-10-02  ticker_1    0.549662   NaN
3    2013-10-01  ticker_0    0.435322   NaN
4    2013-10-02  ticker_2    0.420368   NaN
5    2013-10-03  ticker_0    0.330335    0
6    2013-10-04  ticker_1    0.204649   NaN
7    2013-10-02  ticker_0    0.619271   NaN
8    2013-10-01  ticker_2    0.299655    0

What's going on here? When you call the .diff method on a Pandas object, is it not just calling np.diff? I know there's a diff method on the DataFrame class, but I couldn't figure out how to pass that to transform without the lambda function syntax I used to make data1 work. Am I missing something? Why is the diffs column in data3 screwy? How can I have call the Pandas diff method within transform without needing to write a lambda to do it?

Add the `sort=False` flag to .groupby() please. You're stressing me out. For some reason .groupby with .diff uses heaps of memory and it's all rather inefficient and throwing an unnecessary sort in there just makes it worse. — grofte, Dec 08 '17 at 23:42
Your question is about `np.diff` rather than `pandas.diff`. Edited your title for you. — smci, Nov 14 '19 at 00:31

Jeff · Accepted Answer · 2013-12-19T00:08:54.960

35

Nice easy to reproduce example!! more questions should be like this!

Just pass a lambda to transform (this is tantamount to passing afuncton object, e.g. np.diff (or Series.diff) directly. So this equivalent to data1/data2

In [32]: data3['diffs'] = data3.groupby('ticker')['value'].transform(Series.diff)

In [34]: data3.sort_index(inplace=True)

In [25]: data3
Out[25]: 
         date    ticker     value     diffs
0  2013-10-03  ticker_2  0.435995  0.015627
1  2013-10-04  ticker_2  0.025926 -0.410069
2  2013-10-02  ticker_1  0.549662       NaN
3  2013-10-01  ticker_0  0.435322       NaN
4  2013-10-02  ticker_2  0.420368  0.120713
5  2013-10-03  ticker_0  0.330335 -0.288936
6  2013-10-04  ticker_1  0.204649 -0.345014
7  2013-10-02  ticker_0  0.619271  0.183949
8  2013-10-01  ticker_2  0.299655       NaN

[9 rows x 4 columns]

I believe that np.diff doesn't follow numpy's own unfunc guidelines to process array inputs (whereby it tries various methods to coerce input and send output, e.g. __array__ on input __array_wrap__ on output). I am not really sure why, see a bit more info here. So bottom line is that np.diff is not dealing with the index properly and doing its own calculation (which in this case is wrong).

Pandas has a lot of methods where they don't just call the numpy function, mainly because they handle different dtypes, handle nans, and in this case, handle 'special' diffs. e.g. you can pass a time frequency to a datelike-index where it calculates how many n to actually diff.

edited Dec 19 '13 at 00:08

answered Dec 19 '13 at 00:00

Jeff

125,376
21
220
187

This is weird behaviour! – Andy Hayden Dec 19 '13 at 00:05
yep....it might be a bug or maybe its just a numpy weirdism (IOW their may be a reason it is this way). – Jeff Dec 19 '13 at 00:06
Thanks for complement. I'm new at SO, and trying to make it easy for folks like you to answer my less-than-deep questions! Did you skip my `sort(['ticker', 'date'], inplace=True)` step? Seems like your answer is different from mine as a result. Not a big deal, I think I get the gist of what you're saying. Any feelings about style/stability/extensibility best practices that would lead you preferring either the `transform(lambda x: x.diff())` method or the `blahblah.diff()` method to accomplish this task? – 8one6 Dec 19 '13 at 00:13
@Jeff don't forget to upvote this excellent question, I also marvel it is excellence. – Andy Hayden Dec 19 '13 at 00:13
@DJ_8one6 I HAD not pasted the correct data. updated now. Using a ``lambda`` expression is pretty standard; gets the meaning across. You can also of course use a named function. I think youre data2 method is just fine. Clear and concise. – Jeff Dec 19 '13 at 00:17
Too kind. @Jeff, when I first read your answer, I totally missed that you'd put `Series` into your first line. I kept trying `pd.DataFrame.diff` and having it blow chunks. Works perfectly with `pd.Series.diff`. Can you recommend a place I could look to see which methods (`diff`, `sum`, whatever) properly belong to `Series` vs `DataFrame` vs some other structure? One challenge I've had with Pandas so far is trying to peel back the "magic" that makes it work in 90% of cases to figure out why it's not working in one of my 10% examples. – 8one6 Dec 19 '13 at 00:17
1

@DJ_8one6 hah, I had changed it; I realized you already had the lambda solution up. A groupby like you are doing it (e.g. with ``data3.groupby('ticker')['value']`` will pass a Series into the applied function; if you did ``data3.groupby('ticker').apply(...)`` you would get a DataFrame. That's why a lambda is nice you don't have to exiplicity reference the function as its anonymous. – Jeff Dec 19 '13 at 00:19
2

When I create a groupby I somethins do this: ``data3.groupby(...).apply(f)`` where ``f`` is : ``def f(x): print x; return x``. To see exactly what is passing in. – Jeff Dec 19 '13 at 00:20
@Jeff that is outrageous !! lmao (I also do the print trick, or append to a list, it works great... although sometimes changes depending on the return type!) – Andy Hayden Dec 19 '13 at 00:22
Say my dataframe had two values columns: value_1 and value_2. I can do: `diffs_df = data3.groupby(['ticker'])[['value_1', 'value_2']].transform(lambda x: x.diff())` and that works fine. But `diffs_df = data3.groupby(['ticker'])[['value_1', 'value_2']].transform(pd.DataFrame.diff)` blows chunks. So is it correct to say that `transform` always operates on a series (even if it means it has to work on multiple series in succession), while `apply` works on multiple series all at once as a DataFrame? – 8one6 Dec 19 '13 at 00:33
*always* is a strong word. The way you are selecting things makes a difference, e.g. I think that ``data3.groupby('ticker').[['value_1']]`` will give you a DataFrame (of a single-column); also if you have multi-grouping keys all bets are off – Jeff Dec 19 '13 at 00:37
You basically headed me off at the pass here, but my follow up was going to be: What determines the type of object passed to the function...is it the # of columns, or is it `transform` vs `apply`, or is it `['colname']` vs `[['colname']]`? – 8one6 Dec 19 '13 at 00:38
I know I am late to this party, but is there any to know that this is running a `diff` in order? The code you have Jeff does not pre-sort value based on `date`. Is one supposed to do so? Are there any guarantees that `groupby` preserves the order between rows? – Amelio Vazquez-Reina Feb 10 '15 at 00:50
see line 34 it does pre sort – Jeff Feb 10 '15 at 11:43

score 7 · Answer 2 · answered Dec 19 '13 at 00:01

7

You can see that the Series .diff() method is different to np.diff():

In [11]: data.value.diff()  # Note the NaN
Out[11]: 
0         NaN
1   -0.410069
2    0.523736
3   -0.114340
4   -0.014955
5   -0.090033
6   -0.125686
7    0.414622
8   -0.319616
Name: value, dtype: float64

In [12]: np.diff(data.value.values)  # the values array of the column
Out[12]: 
array([-0.41006867,  0.52373625, -0.11434009, -0.01495459, -0.09003298,
       -0.12568619,  0.41462233, -0.31961629])

In [13]: np.diff(data.value) # on the column (Series)
Out[13]: 
0   NaN
1     0
2     0
3     0
4     0
5     0
6     0
7     0
8   NaN
Name: value, dtype: float64

In [14]: np.diff(data.value.index)  # er... on the index
Out[14]: Int64Index([8], dtype=int64)

In [15]: np.diff(data.value.index.values)
Out[15]: array([1, 1, 1, 1, 1, 1, 1, 1])

answered Dec 19 '13 at 00:01

Andy Hayden

359,921
101
625
535

1

Got it. Your example makes that quite clear. In my original post, I showed two methods that worked (1: involved `lambda` and 2: involved just using the `diff` method directly on a Pandas `Series`). Which of these do you think is more appropriate/best practice? In general, do you try to use as few `lambda`'s as possible, or do you find they make it easier to see the structure of what's going on? – 8one6 Dec 19 '13 at 00:09
2

@DJ_8one6 actually I find lambdas surprisingly efficient in pandas (give the timing a try it's pretty negligible, though pd.Series.diff I think reads nicer)! Atm groupby.diff() is actually doing groupby.apply(pd.Series.diff), and that's usually appears to be slower. Maybe in 0.14 .diff could be made a lot faster, one of many low-hanging fruit / groupby methods. – Andy Hayden Dec 19 '13 at 00:20

Computing np.diff in Pandas after using groupby leads to unexpected result

2 Answers2

Linked

Related