1

I have a dataframe where i wish to compute the number of days (timedelta) that a unique asset remained installed. The sample input dataframe is as follows

df =pd.DataFrame({'Date': ['2007-11-01','2012-03-02','2012-03-02','2013-04-01','2013-04-01','2017-11-15','2017-11-15'], 'action':['installed','installed','removed','installed','removed','installed','removed'], 'asset_alphnum':['A-3724','A-3534','A-3724','A2732','A-3534','A-2007','A2732']})

Outputs:

enter image description here

I have tried pd.crosstab

pd.crosstab(df.asset_alphnum, [df.Date, df.action])

outputs enter image description here

However I cannot work out how to take it from here. Somehow need to collapse the hierarchical index and subract one date from the next.

Any guidance much appreciated.

Prolle
  • 358
  • 1
  • 10

2 Answers2

1

Assuming you only ever have one install date and one removed date for each ID, you can do it with a pivot table, by using max (or min or any other aggregating function) on the date values.

df["Date"] = pd.to_datetime(df["Date"])  # Converting strings to datetimes
df = df.pivot_table(index="asset_alphnum", columns="action", values="Date", aggfunc=max)  # max is just an example, most aggfuncs will work if there's only ever one date
df["time_installed"] = df["removed"] - df["installed"]  # timedelta column

Which outputs this:

asset_alphnum installed removed time_installed
A-2007 2017-11-15 00:00:00 NaT NaT
A-3534 2012-03-02 00:00:00 2013-04-01 00:00:00 395 days 00:00:00
A-3724 2007-11-01 00:00:00 2012-03-02 00:00:00 1583 days 00:00:00
A2732 2013-04-01 00:00:00 2017-11-15 00:00:00 1689 days 00:00:00
OtherBarry
  • 81
  • 1
  • 6
  • that is fantastic, thank you. I assume passing aggfunc=max would mean taking largest values where there are duplicates of install/remove for the same id? – Prolle Feb 11 '21 at 22:41
  • max would take the largest (i.e. latest) date for both installed and removed, so if you had multiple installed/removed in there it would only give you the time installed for the latest installation. – OtherBarry Feb 11 '21 at 22:44
  • ok thanks. it hasn't worked on my actual df. In reality I have multiple install and multiple remove dates for any ID – Prolle Feb 11 '21 at 22:48
  • You would get some odd values if you have multiple installs and removes for the same id, but not an even number of each. You'd end up with a negative value in time_installed. – OtherBarry Feb 11 '21 at 22:48
  • If you want total install time, you should refer to a previous question of yours: https://stackoverflow.com/questions/62521934/ – OtherBarry Feb 11 '21 at 22:48
1

create two columns installed and remove using an apply. Then use the dataframe cross tab to count frequencies

date=['2007-11-01', '2012-03-02', '2012-03-02',
'2013-04-01', '2013-04-01', '2017-11-15', '2017-11-15']
action=['installed', 'installed', 'removed', 'installed','removed','installed','removed']
asset_alphnum=['A-3724','A3534','A-3724','A2732','A-3534','A-2007','A2732']

df=pd.DataFrame({'date':date, 'action':action,'asset_alphnum':asset_alphnum})
df.set_index('date')
df['installed']=df['action'].apply(lambda x: 1 if x=='installed' else 0)
df['removed']=df['action'].apply(lambda x: 1 if x=='removed' else 0)
df.drop('action',axis=1)
print(df)
print(pd.crosstab(df.asset_alphnum, [df.date]))
output:
date           2007-11-01  2012-03-02  2013-04-01  2017-11-15
asset_alphnum                                                
A-2007                  0           0           0               1
A-3534                  0           0           1           0
A-3724                  1           1           0           0
A2732                   0           0           1           1
A3534                   0           1           0           0
Golden Lion
  • 3,840
  • 2
  • 26
  • 35