1

Please see code below.

Overview

I have a dataframe 'df_master' that contains four columns: Date; the name of a specific df from which data should be pulled; measurement on specified Date; and measurement one year after specified Date.

What I am trying to do is write code so that, for each row in df_master, the code pulls measurement data into df_master from one of a large number of other dfs. Each row will have a different df from which it is pulling measurement data. I want to populate two new columns in df_master. One would be the measurement data pulled from the specified df for the date listed in that row in df_master, and the other would be measurement data on the date that is one year in the future.

Example

For instance, consider the code below. The first row in df_master is '2016-01-01' and 'df_B'. This means that the first row of df_master should populate with a value of 250 for 'Measurement_Today' and 265 for 'Measurement_One_Year_in_Future'.

The next row of df_master would need to pull the Measurement data from df_C, etc. In other words, each row of df_master would specify the df from which the Measurement data would come from, and that df differs by row.

Assume that there are thousands of dfs from which df_master must pull data, and combining all of those thousands of dfs into one df is not possible.

I have tried to write a for loop that would pull this data in without success. I also tried to use .iloc to pull in the data for one year in the future, but that is not working, either.

Any help you can provide would be greatly appreciated. Thank you.

Code

# Import dependencies
import pandas as pd
import numpy as np

# Create 'df_A', 'df_B', 'df_C' and 'df_D', which contain measurement data on specific dates.
df_A = pd.DataFrame(np.array([['2016-01-01', 150], ['2017-01-01', 145], 
                              ['2018-01-01', 163], ['2019-01-01', 170],
                              ['2020-01-01', 198], ['2021-01-01', 189],]),
                   columns=['Date', 'Measurement'])
df_A['Date'] = pd.to_datetime(df_A['Date'])

df_B = pd.DataFrame(np.array([['2016-01-01', 250], ['2017-01-01', 265], 
                              ['2018-01-01', 221], ['2019-01-01', 285],
                              ['2020-01-01', 298], ['2021-01-01', 289],]),
                   columns=['Date', 'Measurement'])
df_B['Date'] = pd.to_datetime(df_B['Date'])

df_C = pd.DataFrame(np.array([['2016-01-01', 350], ['2017-01-01', 367], 
                              ['2018-01-01', 392], ['2019-01-01', 370],
                              ['2020-01-01', 398], ['2021-01-01', 389],]),
                   columns=['Date', 'Measurement'])
df_C['Date'] = pd.to_datetime(df_C['Date'])

df_D = pd.DataFrame(np.array([['2016-01-01', 450], ['2017-01-01', 454], 
                              ['2018-01-01', 413], ['2019-01-01', 480],
                              ['2020-01-01', 498], ['2021-01-01', 489],]),
                   columns=['Date', 'Measurement'])
df_D['Date'] = pd.to_datetime(df_D['Date'])


# Create df_master 
df_master = pd.DataFrame(np.array([['2016-01-01', 'df_B','','' ], ['2017-01-01', 'df_C','',''  ], 
                                   ['2018-01-01', 'df_B','','' ], ['2019-01-01', 'df_A','','' ],
                                   ['2018-01-01', 'df_A','','' ], ['2019-01-01', 'df_D','','' ],]),
                   columns=['Date', 'df_to_pull_measurement_from', 'Measurement_Today', 
                            'Measurement_Next_Year'])
df_master['Date'] = pd.to_datetime(df_master['Date'])


# Create list of dfs from df_master['df_to_pull_measurement_from'].
list_of_dfs = df_master['df_to_pull_measurement_from']

### THIS DOES NOT WORK ###
# Add columns to df_master for measurement on given date and measurement one year into future.
for row in list_of_dfs:
    selected_df = row + '[\'Measurement\']'
    df_master['Measurement_Today'] = selected_df
    df_master['Measurement_Next_Year'] = selected_df.iloc[idx +1]
BGG16
  • 474
  • 5
  • 11

1 Answers1

1

I think using globals for string variables is not good idea, better is create dictionary:

#create DatetimeIndex 
df_master = df_master.set_index('Date')

for row in list_of_dfs:

    selected_df = globals()[row].set_index('Date')['Measurement']
    m1 = df_master['df_to_pull_measurement_from'].eq(row)
    nexty = selected_df.rename(lambda x: x - pd.offsets.DateOffset(years= 1))

    df_master.loc[m1, 'Measurement_Today'] = selected_df
    df_master.loc[m1, 'Measurement_Next_Year'] = nexty
    

#create DatetimeIndex 
df_master = df_master.set_index('Date')

#dict of DataFrames
dfs = {'df_A':df_A, 'df_B':df_B,'df_C':df_C, 'df_D':df_D}
#create DatetimeIndex in each DataFrame
dfs1 = {k: v.set_index('Date') for k, v in dfs.items()}

for row in list_of_dfs:

    selected_df = dfs1[row]['Measurement']
    m1 = df_master['df_to_pull_measurement_from'].eq(row)
    #for next years is subtract one year from DatetimeIndex
    nexty = selected_df.rename(lambda x: x - pd.offsets.DateOffset(years= 1))

    df_master.loc[m1, 'Measurement_Today'] = selected_df
    df_master.loc[m1, 'Measurement_Next_Year'] = nexty

print (df_master)

           df_to_pull_measurement_from Measurement_Today Measurement_Next_Year
Date                                                                          
2016-01-01                        df_B               250                   265
2017-01-01                        df_C               367                   392
2018-01-01                        df_B               221                   285
2019-01-01                        df_A               170                   198
2018-01-01                        df_A               163                   170
2019-01-01                        df_D               480                   498
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • Thanks for the very quick and helpful reply. When I run the code above using the dictionary, each time through the loop all rows of 'Measurement_Today' are populated with whichever df is being used in that iteration. In other words, it is not using different dfs for each row. In addition, the 'Measurement_Next_Year' column is populated with the same number in every row, so it does not appear to be picking up the correct data. Any ideas how to tweak your code? Thanks again. – BGG16 Feb 17 '22 at 06:16
  • @GbG - Can you add expected ouput after run loop? – jezrael Feb 17 '22 at 06:18
  • @GbG - answer was edited. – jezrael Feb 17 '22 at 06:42
  • Thank you again for the prompt reply. For some reason, when I run the code myself, I get values of NaN in df_master. The code I ran was all of my code above down to the comment '### THIS DOES NOT WORK ###' and then copied and pasted your code using the dictionary approach. I am running the code in google colab, so perhaps there is a quirk with colab. But are you able to paste the full code that you ran to get the result you showed above (which is exactly the output I am looking for)? Thanks again - really appreciate it. – BGG16 Feb 17 '22 at 06:54
  • 1
    @GbG - I forget for `df_master = df_master.set_index('Date')` add to answer. Now it should working well – jezrael Feb 17 '22 at 06:55
  • 1
    THANK YOU!! It works perfect now. Greatly appreciate it! – BGG16 Feb 17 '22 at 06:57
  • Is there any way to modify the code to select a 'Measurement_Next_Year' by selecting the row that is X rows below the row used for 'Measurement_Today'? Some of the data is not exactly one year apart, and so I would like to be able to pull the date that is X number of rows below the 'Measurement_Today' row and use that data for 'Measurement_Next_Year' – BGG16 Feb 17 '22 at 18:28
  • 1
    @GbG - I think try `nexty = selected_df.shift(-1)` – jezrael Feb 18 '22 at 05:56