3

I am trying to merge two dataframe using a FK and between two dates then save the output in a new dataframe.

Consider the below example:

# first_df 
FK    date          value1   value2 ... (more columns)
1     2019-01-01    50       50
1     2019-01-02    40       80
1     2019-01-03    80       20
1     2019-01-04    18       44
1     2019-01-05    120      50
1     2019-01-06    80       0
1     2019-01-10    60       65
1     2019-01-15    25       44
1     2019-01-25    20       20
2     2019-01-01    50       40
2     2019-01-02    80       45
...............................


# second_df
FK    date          percentage
1     2019-01-01    50
1     2019-01-05    80
1     2019-01-10    40
1     2019-01-15    60
1     2019-01-25    90
2     2019-01-01    48
2     2019-01-08    40
2     2019-01-20    48
......................


# output_df
FK    date          value1            value2 ... (more columns)
1     2019-01-01    50% of 50 = 25    50% of 50 = 25
1     2019-01-02    50% of 40 = 20    50% of 80 = 40
1     2019-01-03    50% of 80 = 40    50% of 20 = 10
1     2019-01-04    50% of 18 = 9     50% of 44 = 22
1     2019-01-05    80% of 120 = 96   80% of 50 = 40
1     2019-01-06    80% of 80 = 64    80% of 0 = 0
1     2019-01-10    40% of 60 = 24    40% of 65 = 26
1     2019-01-15    60% of 25 = 15    60% of 44 = 26.4
1     2019-01-25    90% of 20 = 18    90% of 20 = 18
2     2019-01-01    48% of 50 = 24    48% of 40 = 19.2
2     2019-01-02    48% of 80 = 38.4  48% of 45 = 21.6

Pay attention to the 1st record with FK 2 as you can see, the index is my FK.

The percentage is applied to all records with the same FK where my date is: second_df.date <= first_df.date < and second_df.date_NEXT

For instance, between 2019-01-01 and 2019-01-04 I apply the percentage 50 (coming from second_df)

I have been searching for a moment now for a clean and readable implementation... I know I can set the index on my fk and use apply on my df by specifying the "value1" column. But how would you handle this if there is something like more than 5 columns ?

Hope you will understand that I have little experience with pandas


EDIT1

data1 = {'FK':[1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2],
             'date':['2019-01-01', '2019-01-02', '2019-01-03', '2019-01-04', '2019-01-05', '2019-01-06', '2019-01-10', '2019-01-15', '2019-01-25', '2019-01-01', '2019-01-02'],
             'value1':[50, 40, 80, 18, 120, 80, 60, 25, 20, 50, 80]}
data2 = {'FK': [1, 1, 1, 1, 1, 2, 2],
             'date': ['2019-01-01', '2019-01-05', '2019-01-10', '2019-01-15', '2019-01-25', '2019-01-01',
                      '2019-01-08'],
             'percentage': [50, 80, 40, 60, 90, 48, 40]}

df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
merged_df = pd.merge_asof(df1.sort_values('date'), df2.sort_values('date'), by='FK', on='date').sort_values('FK')

EXCEPTION happening using the above dataset: Function call with ambiguous argument types

In case you have this exception, this is because your column "on" (in my case FK) is not considered as a date, but as a string.

According to panda doc: [...]Furthermore this must be a numeric column, such as datetimelike, integer, or float.

WilliamW
  • 438
  • 7
  • 18

1 Answers1

3

In your case we use merge_asof

df=pd.merge_asof(df1.sort_values('date'),df2.sort_values('date'),by='FK',on='date').sort_values('FK')

Then we have percentage and values in same df, we can do multiple

df[['value1','value2']]=df[['value1','value2']].mul(df.percentage/100,0)
BENY
  • 317,841
  • 20
  • 164
  • 234
  • 1
    Thank you for your very fast and accurate answer. I am getting a ```Function call with ambiguous argument types``` exception when calling the merge_asof on the dataset (check my edit1). – WilliamW Dec 11 '19 at 15:54
  • 1
    Ok in my case, having a FK which is in the format 'YYYY-MM-DD' is not considered as a date but as a string... (which according to the doc, the column "on" MUST be a date, int or float). so using this: https://stackoverflow.com/questions/26763344/convert-pandas-column-to-datetime helped to convert to a datetime and fixed my issue – WilliamW Dec 11 '19 at 16:08
  • Do you know how I can merge data only if the month is the same ? I saw there is a "tolerance" option but this one seems to take only an interval... (which is not known in advance) – WilliamW Dec 16 '19 at 14:03
  • 1
    @WilliamW create Month Year col, by `df1['Y-m']=df1.date.str[:7],df2['Y-m']=df2.date.str[:7]`, then pass to `by=['FK','Y-m']` – BENY Dec 16 '19 at 14:05