1

I have some time series data in a pandas dataframe. Many entries have a NaN value, and I'd like to make a matching column storing whether that entry has a NaN value or not, and then replace the NaN values with 0. While replacing the NaN values is easy enough, I can't figure out how to do the first part. If it matters, I already used a pivot operation so column names are tuples. Some of the values in the dataframe are ints and some are floats.

Example input:

df = pd.DataFrame(np.array([[5, 7,  np.nan], [np.nan, 8, 9.8], [7, np.nan, 12]]), columns=[('Label', 'A'), ('Label', 'B'), ('Label', 'C')])
    
                 Label
             A   B   C
2021-03-01   5   7 NaN
2021-03-02 NaN   8 9.8
2021-03-03   7 NaN  12

Desired output:

                                                        Label
                          A                B                C
            Has data  Value  Has data  Value  Has data  Value
2021-03-01         1      5         1      7         0      0
2021-03-02         0      0         1      8         1     98
2021-03-03         1      7         0      0         1     12
quyksilver
  • 13
  • 6

4 Answers4

2

concat

pd.concat({
    'Has data': df.notna().astype(int),
    'Value': df.fillna(0)
}).unstack(0)

              Label                                    
                  A              B              C      
           Has data Value Has data Value Has data Value
2021-03-01        1   5.0        1   7.0        0   0.0
2021-03-02        0   0.0        1   8.0        1  98.0
2021-03-03        1   7.0        0   0.0        1  12.0
piRSquared
  • 285,575
  • 57
  • 475
  • 624
0
df = pd.DataFrame(np.array([[5, 7,  np.nan], [np.nan, 8, 98], [7, np.nan, 12]]), columns=['A', 'B', 'C'])

enter image description here

df['A has data'] = ~(df['A'].isna()).astype(bool)*1

df['B has data'] = ~(df['B'].isna()).astype(bool)*1

df['C has data'] = ~(df['C'].isna()).astype(bool)*1

enter image description here

df.fillna(0)

enter image description here

Dharman
  • 30,962
  • 25
  • 85
  • 135
aj7amigo
  • 368
  • 2
  • 12
0

If you have df:

df = pd.DataFrame(
    {
        ("Label", "A"): [5, np.nan, 7],
        ("Label", "B"): [5, 8, np.nan],
        ("Label", "C"): [np.nan, 98, 12],
    },
    index=["2021-03-01", "2021-03-02", "2021-03-03"],
)
print(df)

           Label           
               A    B     C
2021-03-01   5.0  5.0   NaN
2021-03-02   NaN  8.0  98.0
2021-03-03   7.0  NaN  12.0

Then:

# reindex dataframe:
mux = pd.MultiIndex.from_product(
    [df.columns.levels[0], df.columns.levels[1], ["Has data", "Value"]]
)
df = df.reindex(columns=mux)

df.loc[:, df.columns.get_level_values(2) == "Has data"] = (
    ~(df.loc[:, df.columns.get_level_values(2) == "Has data"].isna()) * 1
)
df = df.fillna(0)
print(df)

Prints:

              Label                                    
                  A              B              C      
           Has data Value Has data Value Has data Value
2021-03-01        1   5.0        1   5.0        0   0.0
2021-03-02        0   0.0        1   8.0        1  98.0
2021-03-03        1   7.0        0   0.0        1  12.0
Andrej Kesely
  • 168,389
  • 15
  • 48
  • 91
0

Let me borrow the construction of the sample df from @aj7amigo (many thanks! -- but note to @quyksilver to please read this)

df = pd.DataFrame(np.array([[5, 7,  np.nan], [np.nan, 8, 98], [7, np.nan, 12]]), columns=['A', 'B', 'C'])

Down below is the proposed solution. To understand it a bit better we note that this part of it

df.mask(~df.isna(),1).fillna(0).astype(int)

produces


    A   B   C
0   1   1   0
1   0   1   1
2   1   0   1

which has the 'has value' data as you want it. So now we just need to combine it with the original dataframe where we filled nans with 0s. To get it in to the shape you specified there are a few manipulations -- suggest you try these step by step to understand what they do

Here is the solution, note we do not need to loop explicitly over columns etc:

(pd.DataFrame({
    'Value':df.fillna(0).stack(), 
    'Has data':df.mask(~df.isna(),1)
                 .fillna(0)
                 .astype(int)
                 .stack()})
    .stack()
    .unstack(level=1)
    .unstack()
)

this produces

    A                   B                   C
    Value   Has data    Value   Has data    Value   Has data
0   5.0     1.0         7.0     1.0         0.0     0.0
1   0.0     0.0         8.0     1.0         98.0    1.0
2   7.0     1.0         0.0     0.0         12.0    1.0
piterbarg
  • 8,089
  • 2
  • 6
  • 22