Pandas - Compare 2 dataframes to look for first instance per day where value is a multiple from first dataframe

Question

Good Morning: I just started learning pandas and i have been searching for a comparison option but I think I don't know what terms to look for

DF1
Day|Time|Dept|Val
01/01/2020|8:42|PWD|10
01/02/2020|8:42|PWD|60
01/03/2020|8:42|PWD|140
01/04/2020|8:42|PWD|200

DF2 [To compare against]

    Day Time    Dept    Val
    01/01/2020  8:42    PWD 10
    01/01/2020  8:46    PWD 12
    01/01/2020  9:21    PWD 21
    01/01/2020  10:15   PWD 18
    01/01/2020  11:30   PWD 26
    01/01/2020  12:25   PWD 15
    01/02/2020  8:42    PWD 60
    01/02/2020  8:46    PWD 55
    01/02/2020  9:21    PWD 19
    01/02/2020  10:15   PWD 102
    01/02/2020  11:30   PWD 111
    01/02/2020  12:25   PWD 120
    01/03/2020  8:42    PWD 140
    01/03/2020  8:46    PWD 90
    01/03/2020  9:21    PWD 50
    01/03/2020  10:15   PWD 220
    01/03/2020  11:30   PWD 290
    01/03/2020  12:25   PWD 210
    01/04/2020  8:42    PWD 200
    01/04/2020  8:46    PWD 90
    01/04/2020  9:21    PWD 50
    01/04/2020  10:15   PWD 220
    01/04/2020  11:30   PWD 250
    01/04/2020  12:25   PWD 399

What I would like your help with is to find the first instance per day on DF2 based on DF1 Value that is at least double

So for Jan 1, since DF1 value is 10, I have to find the first instance of value greater than or equal to 20 for Just Jan 1 in DF2 and so on.

So output should be either a new dataframe or just modify DF1

   Day         Time  Dept  Val  Found Found_time
0  01/01/2020  8:42  PWD   10   True  9:21
1  01/02/2020  8:42  PWD   60   True  10:15
2  01/03/2020  8:42  PWD  140   True  11:30
3  01/04/2020  8:42  PWD  200  False  NULL

To get more background on what I have done so far, my original question which directed me towards Pandas Row sorting and selection logic in Python on Sqlite db

Thanks all for your help. It's been twenty four years since I have coded and I am amazed at how much languages have changed from fortran and cobol.

score 3 · Accepted Answer · edited Jan 22 '22 at 08:59

3

You only need to keep the maximum value in each Day, then merge on Day and compare the corresponding Value's:

df1 = (df1.merge(df2.sort_values('Val')
                    .drop_duplicates('Day',keep='last')
                    [['Day','Val']],
                 on='Day',
                 how='left',
                 suffixes=['','_r'])
          .assign(Found=lambda x: x['Val']*2 <= x['Val_r'])
          .drop('Val_r', axis=1)
      )

Output:

          Day  Time Dept  Val  Found
0  01/01/2020  8:42  PWD   10   True
1  01/02/2020  8:42  PWD   60   True
2  01/03/2020  8:42  PWD  140   True
3  01/04/2020  8:42  PWD  200  False

Edited to match the updated output:

new_df = (df1.merge(df2, on=['Day','Dept'],
                    how='left',
                    suffixes=['','_r'])
             .assign(Found=lambda x: x['Val_r']>=2*x['Val'],
                     Found_time=lambda x: x.Time_r.where(x.Found))
             .groupby(['Day','Time','Dept'])
             .agg({'Found':'any',
                   'Found_time':'first'})
             .reset_index()
         )

Output:

          Day  Time Dept  Found Found_time
0  01/01/2020  8:42  PWD   True       9:21
1  01/02/2020  8:42  PWD   True      12:25
2  01/03/2020  8:42  PWD   True      11:30
3  01/04/2020  8:42  PWD  False        NaN

edited Jan 22 '22 at 08:59

halfer

19,824
17
99
186

answered May 08 '20 at 17:44

Quang Hoang

146,074
10
56
74

Quang. Thanks. I thought about that solution. The issue is that we need to distribute by time when orders are placed. So ultimately we will need to get the first instance of the time the order is placed eventually. We ultimately would like to have some kind of statistical analysis of what time during the day we have to consume at least twice the amt from the start of the day. – toprockupbeat17 May 08 '20 at 17:59
Now that's not given in the expected output does it :-). Anyway, you could done so with a query after merge. – Quang Hoang May 08 '20 at 18:00
Haha, my bad. Maybe I should edit the final dataframe and add a time field as well, so it can be clearer. – toprockupbeat17 May 08 '20 at 18:03
wow man. Thanks a lot.. let me take a look. That's exactly what I am looking for. – toprockupbeat17 May 08 '20 at 18:36
Quang: I tried that out and it worked great. Thanks again. The Lambda seems to be super powerful. – toprockupbeat17 May 08 '20 at 22:33
Quang, Had a question regarding the process. I am trying to understand the logic here. What I envisioned many many lines of code, you helped me with just a few lines, so Thanks again. it works great So I got the merge part with the left join suffix helps create the new column names Lambda X will help us find the value that is twice the value found in val_r Here I am confused, how does the program know to only find within the particular day? – toprockupbeat17 May 12 '20 at 16:40
Because we merged on `Day`? – Quang Hoang May 12 '20 at 16:46
Cool. Thanks for the reply and for your kindness. I am still thinking from a excel perspective In my mind, just going through the code without knowing the details completely. – toprockupbeat17 May 12 '20 at 16:53

Pandas - Compare 2 dataframes to look for first instance per day where value is a multiple from first dataframe

1 Answers1