new here
I have the following issue. I have 4 df-s.
- main_df - has 5 columns, interested in first 3 (Lease, Opportunity, Archetype) and 300000 rows
- lease_df - has Lease as first column, then has 10 columns corresponding to hours and 15 rows
- opportunity_df - has Opportunity as first column, then has 10 columns corresponding to hours and 160 rows
- archetype_df - has Archetype as first column, then has 10 columns corresponding to hours and 20 rows
For each row in main_df, I want to find the most granular available hourly data from the 3 other tables and merge to main_df. It's important to note that for any given row in tables 2-4 all 10 columns can either all be null or all non-null. archetype_df will not have any nulls, so would be last resort to be merged.
So I want first merge the lease_df on Lease, then iterate over all the rows and see if the 10 columns are null or not. If null, then for that row merge opportunity_df on Opportunity and look if the 10 columns are null, if yes then merge archetype_df on Archetype.
The code below seems to be working but takes >3-4 hours to run since I have 300,000 rows which might increase in the future. Can you help me find an optimized version that would run much faster?
lease_df
Lease | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
---|---|---|---|---|---|---|---|---|---|---|
lease1 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
... |
opportunity_df
Opportunity | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
---|---|---|---|---|---|---|---|---|---|---|
opp1 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
... |
archetype_df
Archetype | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
---|---|---|---|---|---|---|---|---|---|---|
arch1 | 1 | 1.4 | 2 | 2 | 2 | 2 | 2 | 2 | 1.4 | 1 |
... |
main_df_3 = pd.DataFrame()
main_df_1 = pd.merge(main_df, lease_df , on='Lease', how='left')
for i in range(len(main_df_1 )):
if main_df_1 .iloc[i, 5:].isnull().all():
main_df_2 = main_df_1 .iloc[i:i+1, :5]
main_df_2 = pd.merge(main_df_2 , opportunity_df, on='Opportunity', how='left')
if main_df_2 .iloc[0, 5:].isnull().all():
main_df_2 = main_df_2 .iloc[0:1, :5]
main_df_2 = pd.merge(main_df_2 , archetype_df, on='Archetype', how='left')
main_df_3 = pd.concat([main_df_3 , main_df_2 ])