1

I'm having a pandas issue.

I have a dataframe that looks like the following:

name     date
Mark     2018-01-01
Anne     2018-01-01
Anne     2018-02-01
Anne     2018-04-01
Anne     2018-09-01
Anne     2019-01-01    
John     2018-02-01
John     2018-06-01
John     2019-02-01
Ethan    2018-03-01

And I need to compute an additional column, call it months, that contains the number of months between each consecutive row, for the same name. This is in order to calculate how many months have gone by between one appereance of a name and the next one in the DataFrame. The calculation must be done grouping by name, as I only want to know the number of months between consecutive appereances for one user, not for different names.

The expected output in this case would be:

name     date          months
Mark     2018-01-01    0
Anne     2018-01-01    0
Anne     2018-02-01    1
Anne     2018-04-01    2
Anne     2018-09-01    5
Anne     2019-01-01    4
John     2018-02-01    0
John     2018-06-01    4
John     2019-02-01    8
Ethan    2018-03-01    0

Any tips on what is the most efficient way of computing the months column will be highly appreciated.

Note that The dates always have the first day of the month, which makes the calculation much easier.

HRDSL
  • 711
  • 1
  • 5
  • 22
  • Nope, look at the output dataframe i'm looking for. It is similar, but the calculation I want here is different. You gave me the right answer for the other one, maybe you could help to get the answer for this one? Thank you in advance! – HRDSL Aug 19 '19 at 11:39

3 Answers3

2

Use GroupBy.diff and divide by a 1 month timedelta.

df['months'] = df.groupby('name')['date'].diff().div(pd.Timedelta(days=30.44), fill_value=0).round().astype(int)

output

    name       date  months
0   Mark 2018-01-01       0
1   Anne 2018-01-01       0
2   Anne 2018-02-01       1
3   Anne 2018-04-01       2
4   Anne 2018-09-01       5
5   Anne 2019-01-01       4
6   John 2018-02-01       0
7   John 2018-06-01       4
8   John 2019-02-01       8
9  Ethan 2018-03-01       0
Erfan
  • 40,971
  • 8
  • 66
  • 78
  • 1
    This seems to work! I get this warning though: Passing integers to fillna is deprecated, will raise a TypeError in a future version. To retain the old behavior, pass pd.Timedelta(seconds=n) instead. – HRDSL Aug 19 '19 at 11:44
  • Yes we can fix this by using `pd.Timedelta(days=30)` instead. Or if you want to be really accurate: `pd.Timedelta(days=30.4)`. Just checked it. The most accurate version would be: `pd.Timedelta(days=30.4368499)` – Erfan Aug 19 '19 at 11:47
1

Convert values to months and then get difference by DataFrameGroupBy.diff:

df['date'] = pd.to_datetime(df['date'])

a = df['date'].dt.year * 12 + df['date'].dt.month - 1
df['months'] = a.groupby(df['name']).diff().fillna(0).astype(int)
print (df)
    name       date  months
0   Mark 2018-01-01       0
1   Anne 2018-01-01       0
2   Anne 2018-02-01       1
3   Anne 2018-04-01       2
4   Anne 2018-09-01       5
5   Anne 2019-01-01       4
6   John 2018-02-01       0
7   John 2018-06-01       4
8   John 2019-02-01       8
9  Ethan 2018-03-01       0

Another solution:

df['date'] = pd.to_datetime(df['date'])

from operator import attrgetter
df['months'] = (df.assign(month = df['date'].dt.to_period('m'))
                  .groupby('name')['month']
                  .diff()
                  .dropna()
                  .apply(attrgetter('n'))
                  .reindex(df.index, fill_value=0))

print (df)
    name       date  months
0   Mark 2018-01-01       0
1   Anne 2018-01-01       0
2   Anne 2018-02-01       1
3   Anne 2018-04-01       2
4   Anne 2018-09-01       5
5   Anne 2019-01-01       4
6   John 2018-02-01       0
7   John 2018-06-01       4
8   John 2019-02-01       8
9  Ethan 2018-03-01       0
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
0

First extract month from date

df.date  = pd.to_datetime(df.date)
df['Month_from_date'] = pd.DatetimeIndex(df['date']).month

And then caculate difference

df['months'] = df['Month_from_date'] - df['Month_from_date'].shift(1)
Shubh
  • 585
  • 9
  • 29