First non-null value per row from a list of Pandas columns

Question

If I've got a DataFrame in pandas which looks something like:

    A   B   C
0   1 NaN   2
1 NaN   3 NaN
2 NaN   4   5
3 NaN NaN NaN

How can I get the first non-null value from each row? E.g. for the above, I'd like to get: [1, 3, 4, None] (or equivalent Series).

Andy Jones · Answer 1 · 2020-03-19T07:37:24.830

72

Fill the nans from the left with fillna, then get the leftmost column:

df.fillna(method='bfill', axis=1).iloc[:, 0]

edited Mar 19 '20 at 07:37

answered Jun 21 '16 at 08:04

Andy Jones

4,723
2
19
24

5

Awesome solution to my problem. Thanks. How would I get the column name of the first non-null value? – RajeshM Aug 03 '16 at 07:33
1

@RajeshM To get the column name of the first non-null value, consider [this answer](https://stackoverflow.com/a/63238261/). – Asclepius Aug 03 '20 at 23:30

EdChum · Accepted Answer · 2015-08-07T07:50:49.740

14

This is a really messy way to do this, first use first_valid_index to get the valid columns, convert the returned series to a dataframe so we can call apply row-wise and use this to index back to original df:

In [160]:
def func(x):
    if x.values[0] is None:
        return None
    else:
        return df.loc[x.name, x.values[0]]
pd.DataFrame(df.apply(lambda x: x.first_valid_index(), axis=1)).apply(func,axis=1)

Out[160]:
0     1
1     3
2     4
3   NaN
dtype: float64

EDIT

A slightly cleaner way:

In [12]:
def func(x):
    if x.first_valid_index() is None:
        return None
    else:
        return x[x.first_valid_index()]
df.apply(func, axis=1)

Out[12]:
0     1
1     3
2     4
3   NaN
dtype: float64

edited Aug 07 '15 at 07:50

answered Aug 05 '15 at 09:30

EdChum

376,765
198
813
562

The use of `df` as a global here makes me a little bit queasy. Really, you want a combination of [@yangjie's answer](http://stackoverflow.com/a/31828559/2071807) and this one. I'm going to post a combination, but feel free to just steal the best bits (if you think there are any!) – LondonRob Aug 06 '15 at 14:04
@LondonRob I think that really it should just iterate over the df once – EdChum Aug 06 '15 at 14:22
*Now* you're talking. This is now really good. +1 (why not just obliterate your first iteration?) – LondonRob Aug 06 '15 at 14:56
1

@LondonRob I like leaving my thought process, it shows an iterative approach to improving my answer, sometimes just showing the final answer only doesn't demonstrate how to moderate your approach and the responses to feedback – EdChum Aug 06 '15 at 14:58

JoeCondron · Answer 3 · 2016-02-08T11:04:21.813

I'm going to weigh in here as I think this is a good deal faster than any of the proposed methods. argmin gives the index of the first False value in each row of the result of np.isnan in a vectorized way, which is the hard part. It still relies on a Python loop to extract the values but the look up is very quick:

def get_first_non_null(df):
    a = df.values
    col_index = np.isnan(a).argmin(axis=1)
    return [a[row, col] for row, col in enumerate(col_index)]

EDIT: Here's a fully vectorized solution which is can be a good deal faster again depending on the shape of the input. Updated benchmarking below.

def get_first_non_null_vec(df):
    a = df.values
    n_rows, n_cols = a.shape
    col_index = np.isnan(a).argmin(axis=1)
    flat_index = n_cols * np.arange(n_rows) + col_index
    return a.ravel()[flat_index]

If a row is completely null then the corresponding value will be null also. Here's some benchmarking against unutbu's solution:

df = pd.DataFrame(np.random.choice([1, np.nan], (10000, 1500), p=(0.01, 0.99)))
#%timeit df.stack().groupby(level=0).first().reindex(df.index)
%timeit get_first_non_null(df)
%timeit get_first_non_null_vec(df)
1 loops, best of 3: 220 ms per loop
100 loops, best of 3: 16.2 ms per loop
100 loops, best of 3: 12.6 ms per loop
In [109]:


df = pd.DataFrame(np.random.choice([1, np.nan], (100000, 150), p=(0.01, 0.99)))
#%timeit df.stack().groupby(level=0).first().reindex(df.index)
%timeit get_first_non_null(df)
%timeit get_first_non_null_vec(df)
1 loops, best of 3: 246 ms per loop
10 loops, best of 3: 48.2 ms per loop
100 loops, best of 3: 15.7 ms per loop


df = pd.DataFrame(np.random.choice([1, np.nan], (1000000, 15), p=(0.01, 0.99)))
%timeit df.stack().groupby(level=0).first().reindex(df.index)
%timeit get_first_non_null(df)
%timeit get_first_non_null_vec(df)
1 loops, best of 3: 326 ms per loop
1 loops, best of 3: 326 ms per loop
10 loops, best of 3: 35.7 ms per loop

Out of all of these answer, this is the fastest one by a factor of 5 or 10. — user1367204, Aug 06 '15 at 18:19
Small correction is need in return = [a[row, col] for row, col in enumerate(col_index)] — user1367204, Aug 06 '15 at 18:40
Thanks @JoeCondron - I've tried this myself and it is indeed faster, appreciate you adding this. — Dave Challis, Aug 07 '15 at 08:18
Cool. Can you change it to the accepted answer? I'm not sure how that works — JoeCondron, Aug 07 '15 at 09:15
Two typos.. I need to be more careful. Thanks for pointing that out. — JoeCondron, Aug 07 '15 at 13:46

score 13 · Answer 4 · answered Aug 06 '15 at 16:45

Here is another way to do it:

In [183]: df.stack().groupby(level=0).first().reindex(df.index)
Out[183]: 
0     1
1     3
2     4
3   NaN
dtype: float64

The idea here is to use stack to move the columns into a row index level:

In [184]: df.stack()
Out[184]: 
0  A    1
   C    2
1  B    3
2  B    4
   C    5
dtype: float64

Now, if you group by the first row level -- i.e. the original index -- and take the first value from each group, you essentially get the desired result:

In [185]: df.stack().groupby(level=0).first()
Out[185]: 
0    1
1    3
2    4
dtype: float64

All we need to do is reindex the result (using the original index) so as to include rows that are completely NaN:

df.stack().groupby(level=0).first().reindex(df.index)

This is a great walkthrough of how to dynamically fill a single column with non-null string or numeric values from n columns- just as the OP asked. — Lindsay Veazey, Aug 12 '20 at 16:44

score 5 · Answer 5 · edited May 23 '17 at 12:32

5

This is nothing new, but it's a combination of the best bits of @yangie's approach with a list comprehension, and @EdChum's df.apply approach that I think is easiest to understand.

First, which columns to we want to pick our values from?

In [95]: pick_cols = df.apply(pd.Series.first_valid_index, axis=1)

In [96]: pick_cols
Out[96]: 
0       A
1       B
2       B
3    None
dtype: object

Now how do we pick the values?

In [100]: [df.loc[k, v] if v is not None else None 
    ....:     for k, v in pick_cols.iteritems()]
Out[100]: [1.0, 3.0, 4.0, None]

This is ok, but we really want the index to match that of the original DataFrame:

In [98]: pd.Series({k:df.loc[k, v] if v is not None else None
   ....:     for k, v in pick_cols.iteritems()})
Out[98]: 
0     1
1     3
2     4
3   NaN
dtype: float64

edited May 23 '17 at 12:32

Community

1
1

answered Aug 06 '15 at 14:11

LondonRob

73,083
37
144
201

@AndyHayden I'm sort of surprised there's no way to go from `pick_cols` to the final outcome. What would this operation be called do you think? – LondonRob Aug 06 '15 at 14:52
I updated my answer, this iterates just once, you have to test if the `first_valid_index` is `None` to handle the row with all `NaN`s – EdChum Aug 06 '15 at 14:56

score 4 · Answer 6 · answered Aug 22 '18 at 18:10

`groupby` in `axis=1`

If we pass a callable that returns the same value, we group all columns together. This allows us to use groupby.agg which gives us the first method that makes this easy

df.groupby(lambda x: 'Z', 1).first()

     Z
0  1.0
1  3.0
2  4.0
3  NaN

This returns a dataframe with the column name of the thing I was returning in my callable

`lookup`, `notna`, and `idxmax`

df.lookup(df.index, df.notna().idxmax(1))

array([ 1.,  3.,  4., nan])

`argmin` and slicing

v = df.values
v[np.arange(len(df)), np.isnan(v).argmin(1)]

array([ 1.,  3.,  4., nan])

yangjie · Answer 7 · 2015-08-06T16:37:10.377

2

Here is a one line solution:

[row[row.first_valid_index()] if row.first_valid_index() else None for _, row in df.iterrows()]

Edit:

This solution iterates over rows of df. row.first_valid_index() returns label for first non-NA/null value, which will be used as index to get the first non-null item in each row.

If there is no non-null value in the row, row.first_valid_index() would be None, thus cannot be used as index, so I need a if-else statement.

I packed everything into a list comprehension for brevity.

edited Aug 06 '15 at 16:37

answered Aug 05 '15 at 09:30

yangjie

6,619
1
33
40

1

This is great. A bit of explanation about what's going on would make this a really useful answer. Also, it's a shame you have to check `first_valid_index()` twice. Maybe if you lose a little bit of brevity, you'll gain readability and efficiency. – LondonRob Aug 06 '15 at 14:13
Yeah, I am not happy with getting `first_valid_index()` twice as well. Thanks for the advice, maybe I will update later. – yangjie Aug 06 '15 at 14:21

score 2 · Answer 8 · edited May 23 '17 at 12:17

JoeCondron's answer (EDIT: before his last edit!) is cool but there is margin for significant improvement by avoiding the non-vectorized enumeration:

def get_first_non_null_vect(df):
    a = df.values
    col_index = np.isnan(a).argmin(axis=1)
    return a[np.arange(a.shape[0]), col_index]

The improvement is small if the DataFrame is relatively flat:

In [4]: df = pd.DataFrame(np.random.choice([1, np.nan], (10000, 1500), p=(0.01, 0.99)))

In [5]: %timeit get_first_non_null(df)
10 loops, best of 3: 34.9 ms per loop

In [6]: %timeit get_first_non_null_vect(df)
10 loops, best of 3: 31.6 ms per loop

... but can be relevant on slim DataFrames:

In [7]: df = pd.DataFrame(np.random.choice([1, np.nan], (10000, 15), p=(0.1, 0.9)))

In [8]: %timeit get_first_non_null(df)
100 loops, best of 3: 3.75 ms per loop

In [9]: %timeit get_first_non_null_vect(df)
1000 loops, best of 3: 718 µs per loop

Compared to JoeCondron's vectorized version, the runtime is very similar (this is still slightly quicker for slim DataFrames, and slightly slower for large ones).

score 1 · Answer 9 · answered Mar 31 '19 at 23:53

df=pandas.DataFrame({'A':[1, numpy.nan, numpy.nan, numpy.nan], 'B':[numpy.nan, 3, 4, numpy.nan], 'C':[2, numpy.nan, 5, numpy.nan]})

df
     A    B    C
0  1.0  NaN  2.0
1  NaN  3.0  NaN
2  NaN  4.0  5.0
3  NaN  NaN  NaN

df.apply(lambda x: numpy.nan if all(x.isnull()) else x[x.first_valid_index()], axis=1).tolist()
[1.0, 3.0, 4.0, nan]

First non-null value per row from a list of Pandas columns

9 Answers9

`groupby` in `axis=1`

`lookup`, `notna`, and `idxmax`

`argmin` and slicing

Linked

Related

First non-null value per row from a list of Pandas columns

9 Answers9

groupby in axis=1

lookup, notna, and idxmax

argmin and slicing

Linked

Related

`groupby` in `axis=1`

`lookup`, `notna`, and `idxmax`

`argmin` and slicing