0

I am trying to duplicate this result from R in Python. The function I want to apply (np.diff) takes an input and returns an array of the same size. When I try to group I get an output the size of the number of groups, not the number of rows.

Example DataFrame:

df = pd.DataFrame({'sample':[1,1,1,1,1,2,2,2,2,2],'value':[1,2,3,4,5,1,3,2,4,3]})

If I apply diff to it I get close to the result I want, except at the group borders. The (-4) value is a problem.

x = np.diff([df.loc[:,'value']], 1, prepend=0)[0]
df.loc[:,'delta'] = x
    sample  value   delta
0     1       1      1
1     1       2      1
2     1       3      1
3     1       4      1
4     1       5      1
5     2       1     -4
6     2       3      2
7     2       2     -1
8     2       4      2
9     2       3     -1

I think the answer is to use groupby and apply or transform but I cannot figure out the syntax. The closest I can get is:

df.groupby('sample').apply(lambda df: np.diff(df['value'], 1, prepend =0 ))

x
1      [1, 1, 1, 1, 1]
2    [1, 2, -1, 2, -1]
MikeF
  • 764
  • 9
  • 26

1 Answers1

2

Here is possible use DataFrameGroupBy.diff, replace first missing values to 1 and then values to integers:

df['delta'] = df.groupby('sample')['value'].diff().fillna(1).astype(int)
print (df)
   sample  value  delta
0       1      1      1
1       1      2      1
2       1      3      1
3       1      4      1
4       1      5      1
5       2      1      1
6       2      3      2
7       2      2     -1
8       2      4      2
9       2      3     -1

Your solution is possible change by GroupBy.transform, specify processing column after groupby and remove y column in lambda function:

df['delta'] = df.groupby('sample')['value'].transform(lambda x: np.diff(x, 1, prepend = 0))
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • Note for the `DataFrame.diff()` I had to add a `.astype(int)` to coerce back to int value. – MikeF Apr 08 '20 at 12:39