0

I want to create a new column in a pandas dataframe based on values found on a previous row.

Specifically I want to add a column with the difference, in days, between the date found on the actual row and the date found on the last, among previous rows, with the same userId and amount > 0.

I have this:

+--------+------------+-----------+
| UserId |    Date    |    Amount |
+--------+------------+-----------+
|      1 | 2017-01-01 |         0 |
|      1 | 2017-01-03 |        10 |
|      2 | 2017-01-04 |        20 |
|      2 | 2017-01-07 |        15 |
|      1 | 2017-01-09 |         7 |
+--------+------------+-----------+

And I want this

+--------+------------+-----------+-------------+
| UserId |    Date    |    Amount |  Difference |
+--------+------------+-----------+-------------+
|      1 | 2017-01-01 |         0 |          -1 |
|      1 | 2017-01-03 |        10 |          -1 |
|      2 | 2017-01-04 |        20 |          -1 |
|      2 | 2017-01-07 |        15 |           3 |
|      1 | 2017-01-09 |         7 |           6 |
+--------+------------+-----------+-------------+
  • What have you tried so far? You can use `pd.timedelta` and `pd.shift` – G. Anderson Nov 13 '18 at 17:55
  • I have tried `df['difference'] = df.groupby(['UserId']).filter(lambda x: (x['Amount'] > 0).any())['Date'].diff().fillna(-1)` but this way only filtered columns are modified – Marco Nardelli Nov 13 '18 at 18:24
  • Possibly `groupby('UserID').max('Date')` to get the last date value for each ID, then do a `timedelta` to pass that value back into the original dataframe – G. Anderson Nov 13 '18 at 18:34

2 Answers2

0

You were really close; I just modified your code a bit.

"""
UserId     Date        Amount 
1  2017-01-01          0 
1  2017-01-03         10 
2  2017-01-04         20 
2  2017-01-07         15 
1  2017-01-09          7 
"""
import pandas as pd
df = pd.read_clipboard(parse_dates=["Date"])

df['difference'] = df[df['Amount'] > 0].groupby(['UserId'])['Date'].diff().dt.days.fillna(-1)
df.loc[0, "difference"] = -1
df

Output:

   UserId       Date  Amount  difference
0       1 2017-01-01       0        -1.0
1       1 2017-01-03      10        -1.0
2       2 2017-01-04      20        -1.0
3       2 2017-01-07      15         3.0
4       1 2017-01-09       7         6.0

Help from: Python: Convert timedelta to int in a dataframe

Obviously, I manually changed the first row; how does the rest of your df shake out when using this code?

Evan
  • 2,121
  • 14
  • 27
  • Still something is not working tho, in the first one there is NaT and in the secondo one there is -1 days, I would like to have a 0 time difference on both, another problem is that I'm not sure is taking the most recent row... – Marco Nardelli Nov 13 '18 at 21:13
  • Updated to reflect your comments. – Evan Nov 13 '18 at 22:30
  • Still not working, is not taking the last entry but the first one to calculate the difference... – Marco Nardelli Nov 15 '18 at 10:50
  • I'm not sure what you mean; can you post a larger (15-20 rows) sample dataset with your desired output? – Evan Nov 15 '18 at 15:17
0

Just another way around taking your approach into the consideration:

First convert your Date column to datetime using pandas function to_datetime .

df['Date'] = pd.to_datetime(df['Date'])

Now take the difference by days using groupby, which will show the diffrence and rest values will be resulted as NaN

df['Difference'] = df[df['Amount'] > 0].groupby(['UserId'])['Date'].diff().dt.days

df
   UserId       Date  Amount  Difference
0       1 2017-01-01       0         NaN
1       1 2017-01-03      10         NaN
2       2 2017-01-04      20         NaN
3       2 2017-01-07      15         3.0
4       2 2017-01-09       7         2.0

Now, Lastly Fill all the NaN's to -1 in the DataFrames Difference column.

df['Difference'] = df['Difference'].fillna("-1")
# df = df.fillna("-1") <-- this do the Job but in case you have NaNs in other location in df it will also replace them as `-1`

Result:

df
   UserId       Date  Amount Difference
0       1 2017-01-01       0         -1
1       1 2017-01-03      10         -1
2       2 2017-01-04      20         -1
3       2 2017-01-07      15          3
4       2 2017-01-09       7          2
Karn Kumar
  • 8,518
  • 3
  • 27
  • 53
  • There is still a problem, this way is not going to take the last entry, in fact the last userid of your example has been changed to 2 but it should be 1 – Marco Nardelli Nov 15 '18 at 10:49