42

If I've got a DataFrame in pandas which looks something like:

    A   B   C
0   1 NaN   2
1 NaN   3 NaN
2 NaN   4   5
3 NaN NaN NaN

How can I get the first non-null value from each row? E.g. for the above, I'd like to get: [1, 3, 4, None] (or equivalent Series).

Dave Challis
  • 3,525
  • 2
  • 37
  • 65

9 Answers9

72

Fill the nans from the left with fillna, then get the leftmost column:

df.fillna(method='bfill', axis=1).iloc[:, 0]
Andy Jones
  • 4,723
  • 2
  • 19
  • 24
  • 5
    Awesome solution to my problem. Thanks. How would I get the column name of the first non-null value? – RajeshM Aug 03 '16 at 07:33
  • 1
    @RajeshM To get the column name of the first non-null value, consider [this answer](https://stackoverflow.com/a/63238261/). – Asclepius Aug 03 '20 at 23:30
14

This is a really messy way to do this, first use first_valid_index to get the valid columns, convert the returned series to a dataframe so we can call apply row-wise and use this to index back to original df:

In [160]:
def func(x):
    if x.values[0] is None:
        return None
    else:
        return df.loc[x.name, x.values[0]]
pd.DataFrame(df.apply(lambda x: x.first_valid_index(), axis=1)).apply(func,axis=1)
​
Out[160]:
0     1
1     3
2     4
3   NaN
dtype: float64

EDIT

A slightly cleaner way:

In [12]:
def func(x):
    if x.first_valid_index() is None:
        return None
    else:
        return x[x.first_valid_index()]
df.apply(func, axis=1)

Out[12]:
0     1
1     3
2     4
3   NaN
dtype: float64
EdChum
  • 376,765
  • 198
  • 813
  • 562
  • The use of `df` as a global here makes me a little bit queasy. Really, you want a combination of [@yangjie's answer](http://stackoverflow.com/a/31828559/2071807) and this one. I'm going to post a combination, but feel free to just steal the best bits (if you think there are any!) – LondonRob Aug 06 '15 at 14:04
  • @LondonRob I think that really it should just iterate over the df once – EdChum Aug 06 '15 at 14:22
  • *Now* you're talking. This is now really good. +1 (why not just obliterate your first iteration?) – LondonRob Aug 06 '15 at 14:56
  • 1
    @LondonRob I like leaving my thought process, it shows an iterative approach to improving my answer, sometimes just showing the final answer only doesn't demonstrate how to moderate your approach and the responses to feedback – EdChum Aug 06 '15 at 14:58
14

I'm going to weigh in here as I think this is a good deal faster than any of the proposed methods. argmin gives the index of the first False value in each row of the result of np.isnan in a vectorized way, which is the hard part. It still relies on a Python loop to extract the values but the look up is very quick:

def get_first_non_null(df):
    a = df.values
    col_index = np.isnan(a).argmin(axis=1)
    return [a[row, col] for row, col in enumerate(col_index)]

EDIT: Here's a fully vectorized solution which is can be a good deal faster again depending on the shape of the input. Updated benchmarking below.

def get_first_non_null_vec(df):
    a = df.values
    n_rows, n_cols = a.shape
    col_index = np.isnan(a).argmin(axis=1)
    flat_index = n_cols * np.arange(n_rows) + col_index
    return a.ravel()[flat_index]

If a row is completely null then the corresponding value will be null also. Here's some benchmarking against unutbu's solution:

df = pd.DataFrame(np.random.choice([1, np.nan], (10000, 1500), p=(0.01, 0.99)))
#%timeit df.stack().groupby(level=0).first().reindex(df.index)
%timeit get_first_non_null(df)
%timeit get_first_non_null_vec(df)
1 loops, best of 3: 220 ms per loop
100 loops, best of 3: 16.2 ms per loop
100 loops, best of 3: 12.6 ms per loop
In [109]:


df = pd.DataFrame(np.random.choice([1, np.nan], (100000, 150), p=(0.01, 0.99)))
#%timeit df.stack().groupby(level=0).first().reindex(df.index)
%timeit get_first_non_null(df)
%timeit get_first_non_null_vec(df)
1 loops, best of 3: 246 ms per loop
10 loops, best of 3: 48.2 ms per loop
100 loops, best of 3: 15.7 ms per loop


df = pd.DataFrame(np.random.choice([1, np.nan], (1000000, 15), p=(0.01, 0.99)))
%timeit df.stack().groupby(level=0).first().reindex(df.index)
%timeit get_first_non_null(df)
%timeit get_first_non_null_vec(df)
1 loops, best of 3: 326 ms per loop
1 loops, best of 3: 326 ms per loop
10 loops, best of 3: 35.7 ms per loop
JoeCondron
  • 8,546
  • 3
  • 27
  • 28
13

Here is another way to do it:

In [183]: df.stack().groupby(level=0).first().reindex(df.index)
Out[183]: 
0     1
1     3
2     4
3   NaN
dtype: float64

The idea here is to use stack to move the columns into a row index level:

In [184]: df.stack()
Out[184]: 
0  A    1
   C    2
1  B    3
2  B    4
   C    5
dtype: float64

Now, if you group by the first row level -- i.e. the original index -- and take the first value from each group, you essentially get the desired result:

In [185]: df.stack().groupby(level=0).first()
Out[185]: 
0    1
1    3
2    4
dtype: float64

All we need to do is reindex the result (using the original index) so as to include rows that are completely NaN:

df.stack().groupby(level=0).first().reindex(df.index)
unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677
  • This is a great walkthrough of how to dynamically fill a single column with non-null string or numeric values from n columns- just as the OP asked. – Lindsay Veazey Aug 12 '20 at 16:44
5

This is nothing new, but it's a combination of the best bits of @yangie's approach with a list comprehension, and @EdChum's df.apply approach that I think is easiest to understand.

First, which columns to we want to pick our values from?

In [95]: pick_cols = df.apply(pd.Series.first_valid_index, axis=1)

In [96]: pick_cols
Out[96]: 
0       A
1       B
2       B
3    None
dtype: object

Now how do we pick the values?

In [100]: [df.loc[k, v] if v is not None else None 
    ....:     for k, v in pick_cols.iteritems()]
Out[100]: [1.0, 3.0, 4.0, None]

This is ok, but we really want the index to match that of the original DataFrame:

In [98]: pd.Series({k:df.loc[k, v] if v is not None else None
   ....:     for k, v in pick_cols.iteritems()})
Out[98]: 
0     1
1     3
2     4
3   NaN
dtype: float64
Community
  • 1
  • 1
LondonRob
  • 73,083
  • 37
  • 144
  • 201
  • @AndyHayden I'm sort of surprised there's no way to go from `pick_cols` to the final outcome. What would this operation be called do you think? – LondonRob Aug 06 '15 at 14:52
  • I updated my answer, this iterates just once, you have to test if the `first_valid_index` is `None` to handle the row with all `NaN`s – EdChum Aug 06 '15 at 14:56
4

groupby in axis=1

If we pass a callable that returns the same value, we group all columns together. This allows us to use groupby.agg which gives us the first method that makes this easy

df.groupby(lambda x: 'Z', 1).first()

     Z
0  1.0
1  3.0
2  4.0
3  NaN

This returns a dataframe with the column name of the thing I was returning in my callable


lookup, notna, and idxmax

df.lookup(df.index, df.notna().idxmax(1))

array([ 1.,  3.,  4., nan])

argmin and slicing

v = df.values
v[np.arange(len(df)), np.isnan(v).argmin(1)]

array([ 1.,  3.,  4., nan])
piRSquared
  • 285,575
  • 57
  • 475
  • 624
2

Here is a one line solution:

[row[row.first_valid_index()] if row.first_valid_index() else None for _, row in df.iterrows()]

Edit:

This solution iterates over rows of df. row.first_valid_index() returns label for first non-NA/null value, which will be used as index to get the first non-null item in each row.

If there is no non-null value in the row, row.first_valid_index() would be None, thus cannot be used as index, so I need a if-else statement.

I packed everything into a list comprehension for brevity.

yangjie
  • 6,619
  • 1
  • 33
  • 40
  • 1
    This is great. A bit of explanation about what's going on would make this a really useful answer. Also, it's a shame you have to check `first_valid_index()` twice. Maybe if you lose a little bit of brevity, you'll gain readability and efficiency. – LondonRob Aug 06 '15 at 14:13
  • Yeah, I am not happy with getting `first_valid_index()` twice as well. Thanks for the advice, maybe I will update later. – yangjie Aug 06 '15 at 14:21
2

JoeCondron's answer (EDIT: before his last edit!) is cool but there is margin for significant improvement by avoiding the non-vectorized enumeration:

def get_first_non_null_vect(df):
    a = df.values
    col_index = np.isnan(a).argmin(axis=1)
    return a[np.arange(a.shape[0]), col_index]

The improvement is small if the DataFrame is relatively flat:

In [4]: df = pd.DataFrame(np.random.choice([1, np.nan], (10000, 1500), p=(0.01, 0.99)))

In [5]: %timeit get_first_non_null(df)
10 loops, best of 3: 34.9 ms per loop

In [6]: %timeit get_first_non_null_vect(df)
10 loops, best of 3: 31.6 ms per loop

... but can be relevant on slim DataFrames:

In [7]: df = pd.DataFrame(np.random.choice([1, np.nan], (10000, 15), p=(0.1, 0.9)))

In [8]: %timeit get_first_non_null(df)
100 loops, best of 3: 3.75 ms per loop

In [9]: %timeit get_first_non_null_vect(df)
1000 loops, best of 3: 718 µs per loop

Compared to JoeCondron's vectorized version, the runtime is very similar (this is still slightly quicker for slim DataFrames, and slightly slower for large ones).

Community
  • 1
  • 1
Pietro Battiston
  • 7,930
  • 3
  • 42
  • 45
1
df=pandas.DataFrame({'A':[1, numpy.nan, numpy.nan, numpy.nan], 'B':[numpy.nan, 3, 4, numpy.nan], 'C':[2, numpy.nan, 5, numpy.nan]})

df
     A    B    C
0  1.0  NaN  2.0
1  NaN  3.0  NaN
2  NaN  4.0  5.0
3  NaN  NaN  NaN

df.apply(lambda x: numpy.nan if all(x.isnull()) else x[x.first_valid_index()], axis=1).tolist()
[1.0, 3.0, 4.0, nan]
bhamu
  • 121
  • 6