How can I pivot a dataframe but keep duplicates as duplicates?

Question

I have a dataframe with some clock in/clock out times that looks like this:

                  Date In/Out       Id
0  2020-11-04 14:25:25     In   912907
1  2020-11-04 14:25:43     In  1111111
2  2020-11-04 14:26:20    Out  1111111
3  2020-11-04 14:26:29    Out   912907
4  2020-11-05 14:25:25     In   912907
5  2020-11-05 14:26:29    Out   912907

I would like to transform it to look like this:

In/Out       Id                   In                  Out
0       1111111  2020-11-04 14:25:43  2020-11-04 14:26:20
1        912907  2020-11-04 14:25:25  2020-11-04 14:26:29
2        912907  2020-11-05 14:25:25  2020-11-05 14:26:29

I've tried pivoting on Id, but the repeated Id is causing a ValueError. How can I accomplish this?

import pandas

df = pandas.DataFrame(data={
    'Date':['2020-11-04 14:25:25','2020-11-04 14:25:43','2020-11-04 14:26:20','2020-11-04 14:26:29','2020-11-05 14:25:25','2020-11-05 14:26:29'],
    'In/Out':['In','In','Out','Out','In','Out'],
    'Id':['912907','1111111','1111111','912907','912907','912907']
})

print(df)

print(df.drop(index=[4,5]).pivot(index='Id',columns='In/Out',values='Date').reset_index())

try:
    print(df.pivot(index='Id',columns='In/Out',values='Date').reset_index())
except ValueError:
    print('ValueError: Index contains duplicate entries, cannot reshape')

2

see Question `10` and `10.1` – Umar.H Nov 06 '20 at 15:39

score 2 · Accepted Answer · answered Nov 06 '20 at 15:36

You can also use groupby().cumcount to enumerate the relative row orders:

# chain with `reset_index` if you want
(df.assign(index=df.sort_values(['Date']).groupby(['Id','In/Out']).cumcount())
   .pivot(index=['Id','index'], columns='In/Out', values="Date")
)

Output:

In/Out                          In                  Out
Id      index                                          
912907  0      2020-11-04 14:25:25  2020-11-04 14:26:29
        1      2020-11-05 14:25:25  2020-11-05 14:26:29
1111111 0      2020-11-04 14:25:43  2020-11-04 14:26:20

nice, i prefer `cumcount` with `unstack` but this feels more explicit. `df.set_index([df.groupby(['Id','In/Out']).cumcount(),'Id','In/Out']).unstack(-1)` — Umar.H, Nov 06 '20 at 15:37

score 0 · Answer 2 · answered Nov 06 '20 at 15:27

One approach is to assign a unique number to each in/out session. We can do this by sorting the visits by person and then time, so that each 'In' is followed by its 'Out'. Then we can take the row number for the 'In', the row number minus 1 for the 'Out', and pivot on that.

df = df.sort_values(by=['Id','Date']).reset_index(drop=True).reset_index()
# now the column named 'index' is equal to the row number
df.loc[ df['In/Out']=='Out' , 'index' ] -= 1

print(df.pivot(index=['index','Id'],columns='In/Out',values='Date').reset_index())

Prints:

In/Out  index       Id                   In                  Out
0           0  1111111  2020-11-04 14:25:43  2020-11-04 14:26:20
1           2   912907  2020-11-04 14:25:25  2020-11-04 14:26:29
2           4   912907  2020-11-05 14:25:25  2020-11-05 14:26:29

How can I pivot a dataframe but keep duplicates as duplicates?

2 Answers2