0

I have a dataframe that looks something like this:

d={'business':['FX','FX','IR','IR'],\
'date':(['01/01/2018','05/01/2018','01/01/2018','05/01/2018']),\
'amt':[1,5,101,105]}
df=pd.DataFrame(data=d)
df['date'] = pd.to_datetime(df['date'],format='%d/%m/%Y')
df

Is there a function that will expand the dataframe above to look something like:

 d_out={'business':['FX','FX','FX','FX','FX','IR','IR','IR','IR','IR'],\
'date':(['01/01/2018','02/01/2018','03/01/2018','04/01/2018','05/01/2018',\
        '01/01/2018','02/01/2018','03/01/2018','04/01/2018','05/01/2018']),\
'amt':[1,2,3,4,5,101,102,103,104,105]}

d_out=pd.DataFrame(data=d_out)

d_out

I am trying to insert rows based on the number of days between two dates and populate the amt field based on some kind of simple average.

Just checking to see the most efficient read easy way of doing the above !

Thanks,

Number Logic
  • 852
  • 1
  • 9
  • 19

3 Answers3

1

I think that you'll be better off using the date column as a time-index, and using the amt of the FX/IR businesses as two columns (called, for example, IR_amt and FX_amt).
Then, you can use .interpolate on the dataframe and immediately obtain the solution. No additional functions defined, etc.
Code example:

import numpy as np

for business in set(df['business'].values):
    df['{}_amt'.format(business)] = df.apply(lambda row: row['amt'] if row['business']==business else np.nan, axis=1)

df = df.drop(['business','amt'],axis=1).groupby('date').mean()
df = df.resample('1D').interpolate()
Itamar Mushkin
  • 2,803
  • 2
  • 16
  • 32
  • 1
    Sure, added in this edit. Run it on your side, see if it works. I guess the first loop could've been avoided, but this is still cleaner than the alternatives, by utilizing the power of `resample` (to one day) and `interpolate`. – Itamar Mushkin May 19 '19 at 11:37
  • I avoided the first loop by using a multi index . Thanks for saving me LOADS of time and for an elegant solution ! – Number Logic May 20 '19 at 17:01
0

agg the df back to list mode , then look at unnesting

x=df.groupby('business').agg({'amt':lambda x : list(range(x.min(),x.max()+1)),'date':lambda x : list(pd.date_range(x.min(),x.max()))})
yourdf=unnesting(x,['amt','date'])
yourdf#yourdf=yourdf.reset_index)
Out[108]: 
          amt       date
business                
FX          1 2018-01-01
FX          2 2018-01-02
FX          3 2018-01-03
FX          4 2018-01-04
FX          5 2018-01-05
IR        101 2018-01-01
IR        102 2018-01-02
IR        103 2018-01-03
IR        104 2018-01-04
IR        105 2018-01-05

def unnesting(df, explode):
    idx = df.index.repeat(df[explode[0]].str.len())
    df1 = pd.concat([
        pd.DataFrame({x: np.concatenate(df[x].values)}) for x in explode], axis=1)
    df1.index = idx

    return df1.join(df.drop(explode, 1), how='left')
BENY
  • 317,841
  • 20
  • 164
  • 234
  • I get a NaT when I pass:d={'business':['FX','FX','IR','IR'],\ 'date':(['01/01/2018','05/01/2018','01/01/2018','05/01/2018']),'amt':[1,5,101,110]} df=pd.DataFrame(data=d) df['date'] = pd.to_datetime(df['date'],format='%d/%m/%Y') df – Number Logic May 19 '19 at 00:55
  • @NumberLogic try my code with your sample data , if that work , then you may check what is the difference between your sample data and your real data – BENY May 19 '19 at 00:56
0

There are couple of things you need to take care of :

  1. Create an empty array
  2. Check if there is a gap of more than a day in the 'date' column, if yes then append:

    -To the above array append the new consecutive dates.

    -Add 'business' values, add 'amt' by taking the average value for the consecutive rows in the original data frame

Below is the way I did:

import pandas as pd
import numpy as np
d={'business':['FX','FX','IR','IR'],\
'date':(['01/01/2018','05/01/2018','01/01/2018','05/01/2018']),\
'amt':[1,5,101,105]}
df=pd.DataFrame(data=d)
df['date'] = pd.to_datetime(df['date'],format='%d/%m/%Y')
df_array = []
result_df = df
orig_row=0
new_row=0
for i in range(len(df)):
    df_array.append(df.values[orig_row])
    if orig_row <len(df)-1:
        if ((df.date[orig_row+1] - df.date[orig_row]).days > 1):
            amt_avg = (df.amt[orig_row]+df.amt[orig_row+1])/2
            for i in range(((df.date[orig_row+1] - df.date[orig_row]).days)-1):
                df_array.append([df.business[orig_row],df.date[orig_row]+timedelta(days=i+1), amt_avg])
    orig_row+=1
result_df = pd.DataFrame(df_array,columns=['business','date','amt'])

Output:

  business       date    amt
0       FX 2018-01-01    1.0
1       FX 2018-01-02    3.0
2       FX 2018-01-03    3.0
3       FX 2018-01-04    3.0
4       FX 2018-01-05    5.0
5       IR 2018-01-01  101.0
6       IR 2018-01-02  103.0
7       IR 2018-01-03  103.0
8       IR 2018-01-04  103.0
9       IR 2018-01-05  105.0
prabhudotpy
  • 31
  • 1
  • 7
  • if you want to take the average for the 'business' column you would need to do the categorical encoding(only if there are too many categories in you dataset) , otherwise this answer should d othe trick. – prabhudotpy May 19 '19 at 10:23