How to make a matching column in a pandas dataframe recording whether its pair has data in that entry location

Question

I have some time series data in a pandas dataframe. Many entries have a NaN value, and I'd like to make a matching column storing whether that entry has a NaN value or not, and then replace the NaN values with 0. While replacing the NaN values is easy enough, I can't figure out how to do the first part. If it matters, I already used a pivot operation so column names are tuples. Some of the values in the dataframe are ints and some are floats.

Example input:

df = pd.DataFrame(np.array([[5, 7,  np.nan], [np.nan, 8, 9.8], [7, np.nan, 12]]), columns=[('Label', 'A'), ('Label', 'B'), ('Label', 'C')])
    
                 Label
             A   B   C
2021-03-01   5   7 NaN
2021-03-02 NaN   8 9.8
2021-03-03   7 NaN  12

Desired output:

                                                        Label
                          A                B                C
            Has data  Value  Has data  Value  Has data  Value
2021-03-01         1      5         1      7         0      0
2021-03-02         0      0         1      8         1     98
2021-03-03         1      7         0      0         1     12

piRSquared · Accepted Answer · 2021-03-23T00:38:06.980

2

`concat`

pd.concat({
    'Has data': df.notna().astype(int),
    'Value': df.fillna(0)
}).unstack(0)

              Label                                    
                  A              B              C      
           Has data Value Has data Value Has data Value
2021-03-01        1   5.0        1   7.0        0   0.0
2021-03-02        0   0.0        1   8.0        1  98.0
2021-03-03        1   7.0        0   0.0        1  12.0

edited Mar 23 '21 at 00:38

answered Mar 23 '21 at 00:27

piRSquared

285,575
57
475
624

1

Works perfectly. Thank you! – quyksilver Mar 23 '21 at 00:55

score 0 · Answer 2 · edited Mar 22 '21 at 23:55

0

df = pd.DataFrame(np.array([[5, 7,  np.nan], [np.nan, 8, 98], [7, np.nan, 12]]), columns=['A', 'B', 'C'])

df['A has data'] = ~(df['A'].isna()).astype(bool)*1

df['B has data'] = ~(df['B'].isna()).astype(bool)*1

df['C has data'] = ~(df['C'].isna()).astype(bool)*1

df.fillna(0)

edited Mar 22 '21 at 23:55

Dharman

30,962
25
85
135

answered Mar 22 '21 at 23:49

aj7amigo

368
2
12

Andrej Kesely · Answer 3 · 2021-03-23T00:35:48.703

If you have df:

df = pd.DataFrame(
    {
        ("Label", "A"): [5, np.nan, 7],
        ("Label", "B"): [5, 8, np.nan],
        ("Label", "C"): [np.nan, 98, 12],
    },
    index=["2021-03-01", "2021-03-02", "2021-03-03"],
)
print(df)

           Label           
               A    B     C
2021-03-01   5.0  5.0   NaN
2021-03-02   NaN  8.0  98.0
2021-03-03   7.0  NaN  12.0

Then:

# reindex dataframe:
mux = pd.MultiIndex.from_product(
    [df.columns.levels[0], df.columns.levels[1], ["Has data", "Value"]]
)
df = df.reindex(columns=mux)

df.loc[:, df.columns.get_level_values(2) == "Has data"] = (
    ~(df.loc[:, df.columns.get_level_values(2) == "Has data"].isna()) * 1
)
df = df.fillna(0)
print(df)

Prints:

              Label                                    
                  A              B              C      
           Has data Value Has data Value Has data Value
2021-03-01        1   5.0        1   5.0        0   0.0
2021-03-02        0   0.0        1   8.0        1  98.0
2021-03-03        1   7.0        0   0.0        1  12.0

score 0 · Answer 4 · answered Mar 23 '21 at 00:29

Let me borrow the construction of the sample df from @aj7amigo (many thanks! -- but note to @quyksilver to please read this)

df = pd.DataFrame(np.array([[5, 7,  np.nan], [np.nan, 8, 98], [7, np.nan, 12]]), columns=['A', 'B', 'C'])

Down below is the proposed solution. To understand it a bit better we note that this part of it

df.mask(~df.isna(),1).fillna(0).astype(int)

produces


    A   B   C
0   1   1   0
1   0   1   1
2   1   0   1

which has the 'has value' data as you want it. So now we just need to combine it with the original dataframe where we filled nans with 0s. To get it in to the shape you specified there are a few manipulations -- suggest you try these step by step to understand what they do

Here is the solution, note we do not need to loop explicitly over columns etc:

(pd.DataFrame({
    'Value':df.fillna(0).stack(), 
    'Has data':df.mask(~df.isna(),1)
                 .fillna(0)
                 .astype(int)
                 .stack()})
    .stack()
    .unstack(level=1)
    .unstack()
)

this produces

    A                   B                   C
    Value   Has data    Value   Has data    Value   Has data
0   5.0     1.0         7.0     1.0         0.0     0.0
1   0.0     0.0         8.0     1.0         98.0    1.0
2   7.0     1.0         0.0     0.0         12.0    1.0

I'll keep your link in mind for the future, thanks! – quyksilver Mar 23 '21 at 00:51 — quyksilver, Mar 23 '21 at 00:51

How to make a matching column in a pandas dataframe recording whether its pair has data in that entry location

4 Answers4

concat

`concat`