9

I have a sample dataframe show as below. For each line, I want to check the c1 first, if it is not null, then check c2. By this way, find the first notnull column and store that value to column result.

ID  c1  c2  c3  c4  result
1   a   b           a
2       cc  dd      cc
3           ee  ff  ee
4               gg  gg

I am using this way for now. but I would like to know if there is a better method.(The column name do not have any pattern, this is just sample)

df["result"] = np.where(df["c1"].notnull(), df["c1"], None)
df["result"] = np.where(df["result"].notnull(), df["result"], df["c2"])
df["result"] = np.where(df["result"].notnull(), df["result"], df["c3"])
df["result"] = np.where(df["result"].notnull(), df["result"], df["c4"])
df["result"] = np.where(df["result"].notnull(), df["result"], "unknown)

When there are lots of columns, this method looks not good.

jpp
  • 159,742
  • 34
  • 281
  • 339
qqqwww
  • 521
  • 1
  • 10
  • 19
  • Looks like a duplicate of https://stackoverflow.com/questions/23309514/computing-the-first-non-missing-value-from-each-column-in-a-dataframe – hhoke1 Apr 24 '18 at 14:47
  • Possible duplicate of [Computing the first non-missing value from each column in a DataFrame](https://stackoverflow.com/questions/23309514/computing-the-first-non-missing-value-from-each-column-in-a-dataframe) – hhoke1 Apr 24 '18 at 14:48

4 Answers4

10

Use back filling NaNs first and then select first column by iloc:

df['result'] = df[['c1','c2','c3','c4']].bfill(axis=1).iloc[:, 0].fillna('unknown')

Or:

df['result'] = df.iloc[:, 1:].bfill(axis=1).iloc[:, 0].fillna('unknown')

print (df)
   ID   c1   c2  c3   c4 result
0   1    a    b   a  NaN      a
1   2  NaN   cc  dd   cc     cc
2   3  NaN   ee  ff   ee     ee
3   4  NaN  NaN  gg   gg     gg

Performance:

df = pd.concat([df] * 1000, ignore_index=True)


In [220]: %timeit df['result'] = df[['c1','c2','c3','c4']].bfill(axis=1).iloc[:, 0].fillna('unknown')
100 loops, best of 3: 2.78 ms per loop

In [221]: %timeit df['result'] = df.iloc[:, 1:].bfill(axis=1).iloc[:, 0].fillna('unknown')
100 loops, best of 3: 2.7 ms per loop

#jpp solution
In [222]: %%timeit
     ...: cols = df.iloc[:, 1:].T.apply(pd.Series.first_valid_index)
     ...: 
     ...: df['result'] = [df.loc[i, cols[i]] for i in range(len(df.index))]
     ...: 
1 loop, best of 3: 180 ms per loop

#cᴏʟᴅsᴘᴇᴇᴅ'  s solution
In [223]: %timeit df['result'] = df.stack().groupby(level=0).first()
1 loop, best of 3: 606 ms per loop
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
2

Setup

df = df.set_index('ID') # if necessary
df
     c1   c2  c3   c4
ID                   
1     a    b   a  NaN
2   NaN   cc  dd   cc
3   NaN   ee  ff   ee
4   NaN  NaN  gg   gg

Solution
stack + groupby + first
stack implicitly drops NaNs, so groupby.first is guarantee to give you the first non-null value if it exists. Assigning the result back will expose any NaNs at missing indices which you can fillna with a subsequent call.

df['result'] = df.stack().groupby(level=0).first()
# df['result'] = df['result'].fillna('unknown') # if necessary 
df
     c1   c2  c3   c4 result
ID                          
1     a    b   a  NaN      a
2   NaN   cc  dd   cc     cc
3   NaN   ee  ff   ee     ee
4   NaN  NaN  gg   gg     gg

(beware, this is slow for larger dataframes, for performance you may use @jezrael's solution)

cs95
  • 379,657
  • 97
  • 704
  • 746
2

I am using lookup and data from Jpp

df=df.set_index('ID')
s=df.ne('').idxmax(1)
df['Result']=df.lookup(s.index,s)
df
Out[492]: 
   c1  c2  c3  c4 Result
ID                      
1   a   b              a
2      cc  dd         cc
3          ee  ff     ee
4              gg     gg
BENY
  • 317,841
  • 20
  • 164
  • 234
1

One way is to use pd.DataFrame.lookup with pd.Series.first_valid_index applied on a transposed dataframe:

df = pd.DataFrame({'ID': [1, 2, 3, 4],
                   'c1': ['a', '', '', ''],
                   'c2': ['b', 'cc', '', ''],
                   'c3': ['' , 'dd', 'ee', ''],
                   'c4': ['', '', 'ff', 'gg']})

df = df.replace('', np.nan)

df['result'] = df.lookup(df.index, df.iloc[:, 1:].T.apply(pd.Series.first_valid_index))

print(df)

   ID   c1   c2   c3   c4 result
0   1    a    b  NaN  NaN      a
1   2  NaN   cc   dd  NaN     cc
2   3  NaN  NaN   ee   ff     ee
3   4  NaN  NaN  NaN   gg     gg
jpp
  • 159,742
  • 34
  • 281
  • 339