1

This is a slow solution for what I am hoping to achieve. The problem is performance. Is there a more 'pandonic' way to achieve this without the user defined function? The goal is to keep only all rows that are of the first timestamp that occurs in each group.

def get_first_id_time(df):
    first_time = df['datetime'][0]
    df = df.loc[df['datetime']==first_time]

    return df

data = data.groupby('id').apply(get_first_id_time)

EDIT: Note, there are many rows with datetime=first_time, for each group.

dimab0
  • 1,062
  • 8
  • 10
  • use `drop_duplicate(keep=first)` – YusufUMS Apr 12 '19 at 14:27
  • 1
    You could, in this instance, sort by `id` and `datetime`, then `drop_duplicates` on `id` with param `keep='first'` – Alex Apr 12 '19 at 14:27
  • It seems like you simply need `data.groupby('id').head(1)` – yatu Apr 12 '19 at 14:29
  • 1
    Maybe it is not clear enough - there are many rows that are equal to "first_time". So keeping just the first row does not work. – dimab0 Apr 12 '19 at 14:29
  • Perhaps you should include a [mcve] – Alex Apr 12 '19 at 14:30
  • 2
    Then this is `transform` + mask with a Boolean Series: `data[data.groupby('id').datetime.transform(min) == data.datetime]`. There's a dup somewhere – ALollz Apr 12 '19 at 14:32
  • https://stackoverflow.com/questions/15705630/python-getting-the-row-which-has-the-max-value-in-groups-using-groupby That's for max, but just change to min (or `first`) – ALollz Apr 12 '19 at 14:33
  • The comment by ALollz seems most clear and very efficient. If you make this an "answer" to the question I will accept it. Thank you for the help. – dimab0 Apr 12 '19 at 16:23

1 Answers1

3

Can you just get the min datetime and merge?

min_datetime = data.groupby('id')['datetime'].min().reset_index()

data = data.merge(min_datetime, how='inner', on='id')

Edit:

Since there are many rows that have the same first_datetime, just merge on both datetime and id.

Sam
  • 541
  • 1
  • 3
  • 10