0

I've been reading about how it is best practice to avoid using iterrows to iterate through a pandas DataFrame, but I am not sure how else I can solve my particular problem:

How can I:

  1. Find the "time" of the first instance of the value "c" in one DataFrame, df1, grouped by "num" and sorted by "time"
  2. Then add that "time" into a separate DataFrame, df2, based on "num".

Here is an example of my input DataFrame:

import pandas as pd

df = pd.DataFrame({'num': [2, 2, 2, 2, 5, 5, 5, 5, 5, 5, 5, 7, 7, 7, 7, 7, 7, 7, 8, 8, 8, 
                           8, 8, 8, 8, 9, 9, 9, 9, 9], 
                   'state': ['a', 'b', 'c', 'b', 'a', 'b', 'c', 'b', 'c', 'b', 'c', 'a', 
                             'b', 'c', 'b', 'c', 'b', 'c', 'a', 'b', 'c', 'b', 'c', 'b', 
                             'c', 'b', 'c', 'b', 'c', 'b'],
                   'time': [234, 239, 244, 249, 100, 105, 110, 115, 120, 125, 130, 3, 8, 
                            13, 18, 23, 28, 33, 551, 556, 561, 566, 571, 576, 581, 45, 50, 
                            55, 60, 65]})

Expected output (df2):

num time    
2   244
5   110
7   13
8   561
9   50

Every solution I attempt seems like it would require iterrows to load the "time" into df2.

Arne
  • 9,990
  • 2
  • 18
  • 28
atroy01
  • 13
  • 2
  • Well you don't really need to load anything into `df2`. You can get that from an aggregation of `df1` and then if you need to ensure specific rows always or never appear you would `reindex`. If you provide your data in a format people can easily copy and paste (or runable code) you're more likely to get help. It's just a lot of work to reproduce as is – ALollz May 11 '21 at 22:27
  • Please take a look at [How to make good pandas examples](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples), we ask that questions include a [mcve] with your sample input and expected output in the text of your question rather than as a picture or link – G. Anderson May 11 '21 at 22:38

2 Answers2

1

You can do it in one line, using df.groupby() with min() as the aggregation function:

df[df.state == 'c'].drop('state', axis=1).groupby('num').aggregate(min)
    time
num     
2   244
5   110
7   13
8   561
9   50
Arne
  • 9,990
  • 2
  • 18
  • 28
0

Its hard to check without re-creating the df but i think this should do it

def first_c(group):
    filtered = group[group['state'] == 'c'].iloc[0]
    return filtered[['num', 'time']]


df2 = df.groupby('num').apply(first_c)
  1. Group by num
  2. Apply function and filter for c, find first integer index with iloc
  3. return num and time
fthomson
  • 773
  • 3
  • 9