4

here's a sample of my dataset

side     |  serial_number   |   inspector  |   date_1        |    date_2
top      |       10         |   Paul       |   4/1/18 13:21  |    4/1/18 14:22
bot      |       10         |   Jack       |   4/1/18 13:01  |    4/1/18 14:22
bot      |       11         |   Jack       |   4/1/18 14:01  |    4/1/18 14:53
top      |       11         |   Paul       |   4/1/18 14:25  |    4/1/18 14:53
top      |       12         |   Henry      |   4/1/18 14:25  |    4/1/18 14:58

For each unique tuple (serial_number, date_2), I want to keep the row where date_1 is minimum and keep every column, so that eventually my dataset looks like this :

side     |  serial_number   |   inspector  |   date_1        |    date_2
bot      |       10         |   Jack       |   4/1/18 13:01  |    4/1/18 14:22
bot      |       11         |   Jack       |   4/1/18 14:01  |    4/1/18 14:53
top      |       12         |   Henry      |   4/1/18 14:25  |    4/1/18 14:58

To do so, my current code looks like this :

import pandas as pd

df = pd.read_csv("data.csv") #getting the data in a pandas dataframe
df_sorted = df.groupby(['serial_number','date_2'], sort=False)['date_1'].min()
df_sorted .to_csv("data_sorted.csv")

So in the end, I got the right dataset but columns I'm not grouping by are missing. Here's the resulting dataset :

 serial_number   |     date_1        |    date_2
      10         |     4/1/18 13:01  |    4/1/18 14:22
      11         |     4/1/18 14:01  |    4/1/18 14:53
      12         |     4/1/18 14:25  |    4/1/18 14:58

How do I keep all columns ? Thank you.

Jdoe
  • 89
  • 1
  • 7
  • Possible duplicate of [Pandas group by on one column with max date on another column python](https://stackoverflow.com/questions/48754049/pandas-group-by-on-one-column-with-max-date-on-another-column-python) – 3novak Feb 13 '18 at 15:24
  • @3novak it does look similar but I'm not able to apply it to my problem. How would you do it ? – Jdoe Feb 13 '18 at 15:28
  • the solution with sorting and dropping duplicates is directly applicable to your use case. The idxmin is another option. Hopefully, one of these is a suitable solution for you. – 3novak Feb 13 '18 at 15:35

2 Answers2

9

Instead of calling min, after your groupby, which returns the minimum value for each group, instead use idxmin, which returns the index value where the minimum occurs in each group:

df.groupby(['serial_number','date_2'])['date_1'].idxmin()

# serial_number  date_2             
# 10             2018-04-01 14:22:00    1
# 11             2018-04-01 14:53:00    2
# 12             2018-04-01 14:58:00    4

You can then use these indices with iloc to select the complete rows in your dataframe where the minimum for each group occurs:

df.iloc[df.groupby(['serial_number','date_2'])['date_1'].idxmin()]

#   side        serial_number inspector                    date_1  \
# 1  bot                   10     Jack        2018-04-01 13:01:00   
# 2  bot                   11     Jack        2018-04-01 14:01:00   
# 4  top                   12     Henry       2018-04-01 14:25:00   
# 
#                date_2  
# 1 2018-04-01 14:22:00  
# 2 2018-04-01 14:53:00  
# 4 2018-04-01 14:58:00  
cmaher
  • 5,100
  • 1
  • 22
  • 34
  • I choose your solution. I must add that first I had to parse the dates in a dateformat (was a string before) like this : dateparse = lambda x: pd.datetime.strptime(x, '%Y-%m-%d %H:%M') df = pd.read_csv("data.csv", parse_dates=['date_1'], date_parser=dateparse) – Jdoe Feb 13 '18 at 15:52
2

I think what you want can be achieved in 2 steps

  1. Sorting the data with earliest-to-latest date_1
  2. Perform a drop_duplicates with respect to the unique tuples

The following will be one solution:

df = pd.read_csv("data.csv")
df_sorted = df.sort(['date_1'], ascending=True)
df_sorted.drop_duplicates(subset=['serial_number','date_2'], keep='first')

Cheers!