0

I'm working with Python 3.6.5.

Here is a little script to generate a multi index dataframe with some "NaN" value.

import pandas as pd
import numpy as np

att_1 = ['X', 'Y']
att_2 = ['a', 'b']

df_1 = pd.DataFrame(np.random.randint(10,19,size=(5, 2)), columns=att_2, 
index=[10,20,30,35,40])
df_2 = pd.DataFrame(np.random.randint(20,29,size=(5, 2)), columns=att_2, 
index=[20,25,40,50,80])

# Concat df with new key dimension for column attribute
df = pd.concat([df_1, df_2], keys=att_1, axis=1)

I get this dataframe

print(df)
       X           Y      
       a     b     a     b
10  17.0  17.0   NaN   NaN
20  15.0  11.0  20.0  28.0
25   NaN   NaN  23.0  24.0
30  12.0  16.0   NaN   NaN
35  10.0  10.0   NaN   NaN
40  15.0  14.0  25.0  28.0
50   NaN   NaN  22.0  22.0
80   NaN   NaN  23.0  21.0

And I would like to replace the "NaN" value with the last valid value, BUT ONLY FOR ONE COLUMN. For example, I would like to get this (for column named 'X','b')

print(df)
       X           Y      
       a     b     a     b
10  17.0  17.0   NaN   NaN
20  15.0  11.0  20.0  28.0
25   NaN  11.0  23.0  24.0
30  12.0  16.0   NaN   NaN
35  10.0  10.0   NaN   NaN
40  15.0  14.0  25.0  28.0
50   NaN  14.0  22.0  22.0
80   NaN  14.0  23.0  21.0

I tried this :

# Replace NaN value by last valid value for column named 'X','b'
df['X']['b'].fillna(method='ffill', inplace=True)

But I get this error "A value is trying to be set on a copy of a slice from a DataFrame"

I can not find any solution for a dataframe with multi-index of column. I found this link that gives me no hope. (https://pandas.pydata.org/pandas-docs/version/0.22/generated/pandas.MultiIndex.fillna.html)

Does anyone have an idea to help me?

David
  • 450
  • 3
  • 14

1 Answers1

3

After some digging, I found that there's a more appropriate way of referencing columns that we want to edit specifically. Check How to deal with SettingWithCopyWarning in Pandas? out for more info. Another resource: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy.

It is recommended that we use .loc to alter columns. Using the line below removed any errors.

df.loc[df['X']['b'].isnull(), ('X', 'b')] = df['X']['b'].ffill()

However, here I am using the max of the column to replace every NaN with. I'm uncertain on what is meant by the last valid value.

hqkhan
  • 463
  • 2
  • 10
  • I did make a mistake. I corrected my mistake, but I get the following error: « A value is trying to be set on a copy of slice from a DataFrame » – David Dec 11 '18 at 23:22
  • You're right. I've updated my answer. Take a look if it helps. – hqkhan Dec 12 '18 at 04:03
  • your idea is interesting. I updated my question to be clearer about "the last valid value" (= the value directly above ... see my updated question). I tried to get inspired to find an equivalent of the function "df.fillna" applied to a numpy matrix, but I did not succeed. – David Dec 12 '18 at 07:24
  • I finally found a solution. **df.loc[df['X']['b'].isnull(), ('X', 'b')] = df['X']['b'].ffill()** @hqkhan Thanks a lot for your advice to use ".loc" – David Dec 12 '18 at 11:05
  • Ah, I was unable to find `.ffill()`. That's useful for me to know as well. No problem and thank you as well! Glad to help. – hqkhan Dec 12 '18 at 15:17