0

I have a dataframe like below:

  Date        Group   Value   Duration
2018-01-01      A      20       30
2018-02-01      A      10       60
2018-01-01      B      15      180
2018-02-01      B      30      210
2018-03-01      B      25      238 
2018-01-01      C      10      235

In the above df, we see that for group A last date is 2018-02-01. And I want to insert a row with Date as 2018-03-01 and duplicate the values of the previous row. So in other words, for each group, I want equal number of row count by inserting a duplicating row till the Date reaches 2018-03-01.

My resultant Dataframe would look like

  Date        Group   Value   Duration
2018-01-01      A      20       30
2018-02-01      A      10       60
2018-03-01      A      10       60   <----New Row 
2018-01-01      B      15      180
2018-02-01      B      30      210
2018-03-01      B      25      238  
2018-01-01      C      10      235
2018-02-01      C      10      235  <----New Row
2018-03-01      C      10      235  <----New Row

N.B.: We can have 0 values for new row instead of duplication from last existing row.

I am trying to achieve the above by the following codes but not able to progress further:

max_d = df.loc[df.groupby(['Group'])['Date'].idxmax()]
for k,v in df.groupby(['Group']):
   if k['Date'].max() < max_d['Date'].max():
      last_row = 0 <----or k.iloc[-1]
      last_row['Date'] = pd.DateOffset(month=1)
      k.append(last_row)

But I am getting error as:

  Traceback (most recent call last):
File "<ipython-input-6-44d0a094516e>", line 2, in <module>
  if k['Date'].max() < max_d['Date'].max():
TypeError: 'int' object is not subscriptable

Am I missing out anything in the above code? Any clue will be appreciated.

pythondumb
  • 1,187
  • 1
  • 15
  • 30

1 Answers1

1

Use pd.MultiIndex to reindex your df:

df["Date"] = pd.to_datetime(df["Date"])

s = pd.MultiIndex.from_product([df["Date"].unique(),df["Group"].unique()],names=["Date","Group"])

print (df.set_index(["Date","Group"]).reindex(s)
       .reset_index()
       .sort_values(["Group","Date"])
       .ffill())

        Date Group  Value  Duration
0 2018-01-01     A   20.0      30.0
3 2018-02-01     A   10.0      60.0
6 2018-03-01     A   10.0      60.0
1 2018-01-01     B   15.0     180.0
4 2018-02-01     B   30.0     210.0
7 2018-03-01     B   25.0     238.0
2 2018-01-01     C   10.0     235.0
5 2018-02-01     C   10.0     235.0
8 2018-03-01     C   10.0     235.0
Henry Yik
  • 22,275
  • 4
  • 18
  • 40
  • Thanks. Just a quick thought, if I use `ffill(0)`, then it would fill the new rows with zeros, right? Besides, can you suggest how to improve my code? – pythondumb Jun 01 '20 at 02:55
  • `ffill` is forward fill. If you wanted to fill with 0, you should use `fillna(0)` instead or simply do `fill_value=0` during `reindex` and skip the `ffill`. And it is best to avoid looping when you use `pandas`. – Henry Yik Jun 01 '20 at 02:59
  • The code above is generating a blank `df`. i.e. only `Date` and `Group` is being populated...rests all are `nan'. – pythondumb Jun 01 '20 at 05:34
  • Using your provided sample data and the methods above, it correctly generates all columns with proper values. – Henry Yik Jun 01 '20 at 05:36