0

Dear stackoverflow friends, I need your help from senior pandas users for a pretty easy task that I cannot it solve myself.

here's a df with recorded entrance for every person, however, there are multiple ins for people (they get in- check out for lunchbreak- get in again).

The df is sorted ascending for people, date and entrance time.

I need to extract the 1st recorded entrance, thus excluding the others (eg: after lunchbreak). To get to the desired output (1st_stamp) i need to check whether the current row is the first in of the day ( of the same person ofc), then a "y" would appear on column "1st_stamp".

It's tricky because some people have only 1 entrance (eg: Person N.3), some have 2 (Person N.2), and some split their lunch in two breaks so they have 3 in-recorded entrances (Person N. 7).

How would you go on solving this riddle?

Ps:being able to clean this data is of enormous use for staff planning processes. Thank you dears :)

+-------------+------------+------------------+----------+-----------+
|    name     |    Date    |      start       |  tstart  | 1st_stamp |
+-------------+------------+------------------+----------+-----------+
| Person N. 1 | 13/08/2020 | 13/08/2020 07:00 | 07:00:00 | y         |
| Person N. 1 | 13/08/2020 | 13/08/2020 13:10 | 13:10:00 | n         |
| Person N. 2 | 13/08/2020 | 13/08/2020 10:00 | 10:00:00 | y         |
| Person N. 2 | 13/08/2020 | 13/08/2020 13:46 | 13:46:00 | n         |
| Person N. 3 | 13/08/2020 | 13/08/2020 09:00 | 09:00:00 | y         |
| Person N. 4 | 13/08/2020 | 13/08/2020 08:00 | 08:00:00 | y         |
| Person N. 4 | 13/08/2020 | 13/08/2020 13:04 | 13:04:00 | n         |
| Person N. 4 | 13/08/2020 | NaT              | NaT      | n         |
| Person N. 5 | 13/08/2020 | 13/08/2020 10:00 | 10:00:00 | y         |
| Person N. 6 | 13/08/2020 | 13/08/2020 07:00 | 07:00:00 | y         |
| Person N. 6 | 13/08/2020 | 13/08/2020 13:29 | 13:29:00 | n         |
| Person N. 7 | 13/08/2020 | 13/08/2020 08:00 | 08:00:00 | y         |
| Person N. 7 | 13/08/2020 | 13/08/2020 14:01 | 14:01:00 | n         |
| Person N. 7 | 13/08/2020 | 13/08/2020 16:00 | 16:00:00 | n         |
+-------------+------------+------------------+----------+-----------+
F. Arci
  • 3
  • 1

2 Answers2

0

If I understood correctly, you want to create 1st_stamp column right?

To create 1st_stamp column, here is one way to approach:

# 1. Convert to datetime if it isn't already
df['start'] = pd.to_datetime(df['start'])

# 2. Partition data by name and rank them based on start datetime 
df['order'] = df.groupby('name')['start'].rank(method='min')

# 3. Create a variable to indicate if it's the earliest instance
df['1st_stamp'] = np.where(df['order']==1, 'y', 'n')
df

2nd step is copied from this stackoverflow answer.

This will create the order column - if you don't need it, you can just delete it with del(df['order']).

0

Ensure the column is a datetime;

df['start'] = pd.to_datetime(df['start'])

Return the first time you can do something like;

df.groupby(['name', 'Date', 'tstart']).first()

Or first time and the count of entries;

grouped = df.groupby(['name', 'Date', 'tstart']).agg({'tstart': ['min', 'count']})

Pratik Thanki
  • 246
  • 2
  • 4