2

I have a pandas dataframe df as

Date  cost    NC
20     5   NaN
21     7   NaN
23     9  78.0
25     6  80.0

Now what I need to do is fillup the missing dates and hence fill the column with a value say x only if there is number in the previous row. That is I want the output like

Date  cost    NC
20     5   NaN
21     7   NaN
22     x   NaN
23     9  78.0
24     x   x
25     6  80.0

See Date 22 was missing and on 21 NC was missing, So on 22 cost is assigned to x but NC is assigned to NaN. Now setting the Date column to index and reindexing it to missing values I can get upto here

Date  cost    NC 
20     5.0   NaN
21     7.0   NaN
22     NaN   NaN
23     9.0  78.0
24     NaN   NaN
25     6.0  80.0

But I cant get to the final output. If you think this way it is like ffill() but instead of filling from previous row you have to put x here.

I have another problem. here I have a dataframe df like this

Date type  cost
10    a    30
11    a    30
11    b    25
13    a    27

Here also I have to fill the missing value and make it like this

Date type cost
10    a   30
11    a   30
11    b   25
12    a   30
12    b   25
13    a   27

as you can see there was 2 data row for date 11 so both are copied to 12. I wrote this program for the problem

missing=[12]
for i in missing:
    new_date=i
    i-=1        #go to previous date
    k=df[df["Date"] == i].index.tolist()[-1]+1 #index where to be filled
    data=pd.DataFrame(df[df["Date"] == i].values,columns=df.columns)
    data["Date"]=new_date
    df=pd.concat([df.iloc[:k],data,df.iloc[k:]]).reset_index(drop=True)

Now for a large data set the above program takes a lot of time as has to find index and concat 3 dataframe each time. Is there any better and efficient way to solve this problem?

Andy Hayden
  • 359,921
  • 101
  • 625
  • 535
Eular
  • 1,707
  • 4
  • 26
  • 50
  • This is two separate questions. Please ask two separate questions! – Andy Hayden Jun 14 '16 at 20:42
  • Why do you want x rather than NaN? – Andy Hayden Jun 14 '16 at 20:43
  • because in my actual data I have some missing values and also some half filled column and I can not fill those places and the next where i am filling for missing dates. I exactly need what I posted – Eular Jun 14 '16 at 20:47
  • I asked 2 question here because I thought they might be related, one is putting `x` in missing and one is copying from row in missing. Also the next problem is more of "how to improve" like rather than "what is solution". So can I ask it here in SO or I have to ask it code review? – Eular Jun 14 '16 at 20:53
  • Sorry I still don't get it: Why not just use NaN for missing, see e.g. http://stackoverflow.com/a/17534682/1240268. Using x is just going to give you a bad time. If they are distinct questions they should be asked separately - this helps others (searching for a solution similar to your problem), and stops the question being closed as "too broad". – Andy Hayden Jun 14 '16 at 21:30
  • In a simplified way let's say x=0 here and I want to fill the newly made missing value with 0 and keep the previous missing values – Eular Jun 14 '16 at 21:37

1 Answers1

1

I don't think there is a way to pad just the "middle" values, but here's a way to do it (using ffill, bfill and fillna):

In [11]: df1  # assuming Date is the index via df.set_index("Date")
Out[11]:
      cost    NC
Date
20       5   NaN
21       7   NaN
23       9  78.0
25       6  80.0

In [12]: df2 = df1.reindex(np.arange(20,27))
         # 26 is sufficient, but let's see it working!

In [13]: df2
Out[13]:
      cost    NC
Date
20     5.0   NaN
21     7.0   NaN
22     NaN   NaN
23     9.0  78.0
24     NaN   NaN
25     6.0  80.0
26     NaN   NaN

You don't want to fill in the "outside" NaNs, which can be got with:

In [14]: df2.bfill().notnull() & df2.ffill().notnull()
Out[14]:
       cost     NC
Date
20     True  False
21     True  False
22     True  False
23     True   True
24     True   True
25     True   True
26    False  False

Now, we can update these (if they would be updated with a fillna):

In [15]: df2[df2.bfill().notnull() & df2.ffill().notnull()] = df2.fillna(0) # x = 0

In [16]: df2
Out[15]:
      cost    NC
Date
20     5.0   NaN
21     7.0   NaN
22     0.0   NaN
23     9.0  78.0
24     0.0   0.0
25     6.0  80.0
26     NaN   NaN

To (partially) answer the second question, IMO you're always better off in that situation to start with a pivot (this will give you a much better starting point):

In [21]: df
Out[21]:
   Date type  cost
0    10    a    30
1    11    a    30
2    11    b    25
3    13    a    27

In [22]: df.pivot_table("cost", "Date", "type")
Out[22]:
type     a     b
Date
10    30.0   NaN
11    30.0  25.0
13    27.0   NaN

Perhaps you are looking to fill forward from there? (and unstack if necessary).

Andy Hayden
  • 359,921
  • 101
  • 625
  • 535
  • Can you also post how to get back to the final dataframe from this pivot table. I am trying `unstack` but I cant get it done – Eular Jun 15 '16 at 05:28
  • @Eular One way to do the latter is `df.pivot_table("cost", "Date", "type").ffill().stack().reset_index(name="cost")` – Andy Hayden Jun 15 '16 at 06:00
  • thanks for the 1st answer, I actually needed this `df[df.ffill().notnull()] = df.fillna(0)`. Here also i am confused why this works but this `df=df[df.ffill().notnull()].fillna(0)` doesnot again for a single column `df["NC"]=df[df["NC"].ffill().notnull()]["NC"].fillna(0)` this works – Eular Jun 15 '16 at 06:35
  • for the 2nd solution I did this `df.pivot_table(index="Date",columns='type').reindex([10,11,12,13],method="pad").stack().reset_index()` which works perfectly. – Eular Jun 15 '16 at 06:47