6

I want know the first year with incoming revenue for various projects.

Given the following, dataframe:

ID  Y1      Y2      Y3
0   NaN     8       4
1   NaN     NaN     1
2   NaN     NaN     NaN
3   5       3       NaN

I would like to return the name of the first column with a non-null value by row.

In this case, I would want to return:

['Y2','Y3',NaN,'Y1']

My goal is to add this as a column to the original dataframe.

The following code mostly works, but is really clunky.

import pandas as pd
import numpy as np

df = pd.DataFrame({'Y1':[np.nan, np.nan, np.nan, 5],'Y2':[8, np.nan, np.nan, 3], 'Y3':[4, 1, np.nan, np.nan]})
df['first'] = np.nan

for ID in df.index:
row = df.loc[ID,]
for i in range(0,len(row)):
    if (~pd.isnull(row[i])):
        df.loc[ID,'first'] = row.index[i]
        break

returns:

   Y1  Y2  Y3  first
0 NaN  8   4   Y2   
1 NaN NaN  1   Y3   
2 NaN NaN NaN  first
3  5   3  NaN  Y1   

Does anyone know a more elegant solution?

Nathan Clement
  • 1,103
  • 2
  • 18
  • 30

3 Answers3

14

You can apply first_valid_index to each row in the dataframe using a lambda expression with axis=1 to specify rows.

>>> df.apply(lambda row: row.first_valid_index(), axis=1)
ID
0      Y2
1      Y3
2    None
3      Y1
dtype: object

To apply it to your dataframe:

df = df.assign(first = df.apply(lambda row: row.first_valid_index(), axis=1))

>>> df
    Y1  Y2  Y3 first
ID                  
0  NaN   8   4    Y2
1  NaN NaN   1    Y3
2  NaN NaN NaN  None
3    5   3 NaN    Y1
Alexander
  • 105,104
  • 32
  • 201
  • 196
1

Avoiding apply is preferable as its not vectorized. The following is vectorized. It was tested with Pandas 1.1.

Setup

import numpy as np
import pandas as pd

df = pd.DataFrame({'Y1':[np.nan, np.nan, np.nan, 5],'Y2':[8, np.nan, np.nan, 3], 'Y3':[4, 1, np.nan, np.nan]})

# df.dropna(how='all', inplace=True)  # Optional but cleaner

# For ranking only:
col_ranks = pd.DataFrame(index=df.columns, data=np.arange(1, 1 + len(df.columns)), columns=['first_notna_rank'], dtype='UInt8') # UInt8 supports max value of 255.

To find the name of the first non-null column

df['first_notna_name'] = df.dropna(how='all').notna().idxmax(axis=1).astype('string')

If df is guaranteed to have no rows with all nulls, the .dropna operation above can optionally be removed.

To then find the first non-null value

Using bfill:
df['first_notna_value'] = df[df.columns.difference(['first_notna_name'])].bfill(axis=1).iloc[:, 0]
Using melt:
df['first_notna_value'] = df.melt(id_vars='first_notna_name', value_vars=df.columns.difference(['first_notna_name']), ignore_index=False).query('first_notna_name == variable').merge(df[[]], how='right', left_index=True, right_index=True).loc[df.index, 'value']

If df is guaranteed to have no rows with all nulls, the .merge operation above can optionally be removed.

To rank the name

df = df.merge(col_ranks, how='left', left_on='first_notna_name', right_index=True)

Is there a better way?

Output

    Y1   Y2   Y3 first_notna_name  first_notna_value  first_notna_rank
0  NaN  8.0  4.0               Y2                8.0                 2
1  NaN  NaN  1.0               Y3                1.0                 3
2  NaN  NaN  NaN             <NA>                NaN              <NA>
3  5.0  3.0  NaN               Y1                5.0                 1

Partial credit: answers by me, piRSquared, and Andy

Asclepius
  • 57,944
  • 17
  • 167
  • 143
-1

Apply this code to a dataframe with only one row to return the first column in the row that contains a null value.

row.columns[~(row.loc[:].isna()).all()][-1]