How can I merge based on most granular data available?

Question

new here

I have the following issue. I have 4 df-s.

main_df - has 5 columns, interested in first 3 (Lease, Opportunity, Archetype) and 300000 rows
lease_df - has Lease as first column, then has 10 columns corresponding to hours and 15 rows
opportunity_df - has Opportunity as first column, then has 10 columns corresponding to hours and 160 rows
archetype_df - has Archetype as first column, then has 10 columns corresponding to hours and 20 rows

For each row in main_df, I want to find the most granular available hourly data from the 3 other tables and merge to main_df. It's important to note that for any given row in tables 2-4 all 10 columns can either all be null or all non-null. archetype_df will not have any nulls, so would be last resort to be merged.

So I want first merge the lease_df on Lease, then iterate over all the rows and see if the 10 columns are null or not. If null, then for that row merge opportunity_df on Opportunity and look if the 10 columns are null, if yes then merge archetype_df on Archetype.

The code below seems to be working but takes >3-4 hours to run since I have 300,000 rows which might increase in the future. Can you help me find an optimized version that would run much faster?

lease_df

Lease	1	2	3	4	5	6	7	8	9	10
lease1	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
...

opportunity_df

Opportunity	1	2	3	4	5	6	7	8	9	10
opp1	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
...

archetype_df

Archetype	1	2	3	4	5	6	7	8	9	10
arch1	1	1.4	2	2	2	2	2	2	1.4	1
...


main_df_3 = pd.DataFrame()

main_df_1 = pd.merge(main_df, lease_df , on='Lease', how='left')
for i in range(len(main_df_1 )):
    if main_df_1 .iloc[i, 5:].isnull().all():
        main_df_2 = main_df_1 .iloc[i:i+1, :5]
    main_df_2 = pd.merge(main_df_2 , opportunity_df, on='Opportunity', how='left')
    if main_df_2 .iloc[0, 5:].isnull().all():
        main_df_2 = main_df_2 .iloc[0:1, :5]
    main_df_2 = pd.merge(main_df_2 , archetype_df, on='Archetype', how='left')
    main_df_3 = pd.concat([main_df_3 , main_df_2 ])

Please **always** add a [MRE](https://stackoverflow.com/help/minimal-reproducible-example) (also look [here](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples)). Since you seem to have a lot of columns just provide the code to produce a sample or provide a sample with less columns. — Timus, Apr 12 '23 at 11:52

How can I merge based on most granular data available?

0 Answers0