Computing the first non-missing value from each column in a DataFrame

Question

I have a DataFrame which looks like this:

            1125400  5430095  1095751
2013-05-22   105.24      NaN  6507.58
2013-05-23   104.63      NaN  6393.86
2013-05-26   104.62      NaN  6521.54
2013-05-27   104.62      NaN  6609.31
2013-05-28   104.54    87.79  6640.24
2013-05-29   103.91    86.88  6577.39
2013-05-30   103.43    87.66  6516.55
2013-06-02   103.56    87.55  6559.43

I would like to compute the first non-NaN value in each column.

As Locate first and last non NaN values in a Pandas DataFrame points out, first_valid_index can be used. Unfortunately, it returns the first row where at least one element is not NaN and does not work per-column.

I'm voting to reopen this question because the marked duplicate deals with a row-wise operation and this question deals with a column-wise operation. The questions and their answers are substantively different. — William Miller, Dec 13 '22 at 04:24

score 14 · Answer 1 · answered Jun 05 '14 at 15:13

14

You should use the apply function which applies a function on either each column (default) or each row efficiently:

>>> first_valid_indices = df.apply(lambda series: series.first_valid_index())
>>> first_valid_indices
1125400   2013-05-22 00:00:00
5430095   2013-05-28 00:00:00
1095751   2013-05-22 00:00:00

first_valid_indiceswill then be a series containing the first_valid_index for each column.

You could also define the lambda function as a normal function outside:

def first_valid_index(series):
    return series.first_valid_index()

and then call apply like this:

df.apply(first_valid_index)

answered Jun 05 '14 at 15:13

Felix Zumstein

6,737
1
30
62

6

Rather than building a lambda function, or a real function. You could use the unbound function on the Series class. `df.apply(pd.Series.first_valid_index)` – poulter7 Jul 28 '15 at 22:09
The above code only gives indices of each column at which it is the first non null. It is incomplete in the way that it does not provide information on how to use instances in a go. – rko Feb 09 '18 at 22:05

score 2 · Answer 2 · answered Jul 26 '16 at 09:53

The built in function DataFrame.groupby().column.first() returns the first non null value in the column, while last() returns the last.

http://pandas.pydata.org/pandas-docs/stable/generated/pandas.core.groupby.GroupBy.first.html

If you don't wish to get the first value for each group, you can add a dummy column of 1s. Then get the first non null value using the groupby & first functions.

from Pandas import DataFrame

df = DataFrame({'a':[None,1,None],'b':[None,2,None]})
df['dummy'] = 1
df.groupby('dummy').first()
df.groupby('dummy').last()

This only works for numeric types. If any column is of object type, then you will get the None — nenetto, May 12 '20 at 18:04

Woody Pride · Answer 3 · 2014-04-26T11:23:26.240

1

By compute I assume you mean access?

The simplest way to do this is with the pd.Series.first_valid_index() method probably inside a dict comprehension:

values = {col : DF.loc[DF[col].first_valid_index(), col] for col in DF.columns}
values

Just to be clear, each column in a pandas DataFrame is a Series. So the above is the same as doing:

values = {}
for column in DF.columns:
    First_Non_Null_Index = DF[column].first_valid_index()
    values[column] = DF.loc[First_Non_Null_Index, column]

So the operation in my one line solution is on a per column basis. I.e. it is not going to create the type of error you seem to be suggesting in the edit you made to the question. Let me know if it does not work as expected.

edited Apr 26 '14 at 11:23

answered Apr 26 '14 at 10:28

Woody Pride

13,539
9
48
62

This will work, but I was hoping there is an easier method. If I use `df.dropna()`, it drops all rows with at least one NaN in them. I can do it per-series on each of the columns, but I was hoping there is a simpler way. – yevgeny.bezman Apr 26 '14 at 10:45
Hahaha, you don't like the answer. OK. well see the updated answer!!! – Woody Pride Apr 26 '14 at 10:52
In my solution it does work per column. Each column is a pandas Series, and so first_valid_index is doing exactly what you want... It is looking in each column of the dataframe and finding the first non null index point and giving you the value at that point. I believe your edit to the question is incorrect in what it says... – Woody Pride Apr 26 '14 at 11:18
Yes, your solution works per column, on a different series each time. first_valid_index works for series the way I'd expect. calling first_valid_index on a DataFrame works as I described in my edit. – yevgeny.bezman Apr 26 '14 at 18:03
So in fact it does exactly what you wanted? – Woody Pride Apr 27 '14 at 01:08

Computing the first non-missing value from each column in a DataFrame

3 Answers3

Linked