90

I have a Pandas DataFrame indexed by date. There a number of columns but many columns are only populated for part of the time series. I'd like to find where the first and last values non-NaN values are located so that I can extracts the dates and see how long the time series is for a particular column.

Could somebody point me in the right direction as to how I could go about doing something like this?

starball
  • 20,030
  • 7
  • 43
  • 238
Jason
  • 4,346
  • 10
  • 49
  • 75
  • 50
    [first_valid_index](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.first_valid_index.html#pandas.Series.first_valid_index) and [last_valid_index](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.last_valid_index.html#pandas.Series.last_valid_index) – behzad.nouri Mar 14 '14 at 11:16
  • Is there a solution for this when the missing values can be "0"? (i.e find the first non zero value, per group/time series)? – GrimSqueaker Jul 15 '17 at 12:57

3 Answers3

72

@behzad.nouri's solution worked perfectly to return the first and last non-NaN values using Series.first_valid_index and Series.last_valid_index, respectively.

Gadget
  • 434
  • 4
  • 9
Jason
  • 4,346
  • 10
  • 49
  • 75
  • 1
    @KorayTugay A pandas Series is 1D (i.e. single column). If you want to check more columns in a `df` you can iterate over your df. Eg. `for col_name, data in df.items(): print("First valid index for column {} is at {}".format(col_name, data.first_valid_index()))` – Jason Sep 17 '18 at 12:05
  • 2
    Instead of iterating over the columns of a DataFrame, you can use `df.apply(Series.first_valid_index)`. – Peque Nov 02 '20 at 19:29
55

Here's some helpful examples.

Series

s = pd.Series([np.NaN, 1, np.NaN, 3, np.NaN], index=list('abcde'))
s

a    NaN
b    1.0
c    NaN
d    3.0
e    NaN
dtype: float64

# first valid index
s.first_valid_index()
# 'b'

# first valid position
s.index.get_loc(s.first_valid_index())
# 1

# last valid index
s.last_valid_index()
# 'd'

# last valid position
s.index.get_loc(s.last_valid_index())
# 3

Alternative solution using notna and idxmax:

# first valid index
s.notna().idxmax()
# 'b'

# last valid index
s.notna()[::-1].idxmax()
# 'd'

DataFrame

df = pd.DataFrame({
    'A': [np.NaN, 1, np.NaN, 3, np.NaN], 
    'B': [1, np.NaN, np.NaN, np.NaN, np.NaN]
})
df

     A    B
0  NaN  1.0
1  1.0  NaN
2  NaN  NaN
3  3.0  NaN
4  NaN  NaN

(first|last)_valid_index isn't defined on DataFrames, but you can apply them on each column using apply.

# first valid index for each column
df.apply(pd.Series.first_valid_index)

A    1
B    0
dtype: int64

# last valid index for each column
df.apply(pd.Series.last_valid_index)

A    3
B    0
dtype: int64

As before, you can also use notna and idxmax. This is slightly more natural syntax.

# first valid index
df.notna().idxmax()

A    1
B    0
dtype: int64

# last valid index
df.notna()[::-1].idxmax()

A    3
B    0
dtype: int64
cs95
  • 379,657
  • 97
  • 704
  • 746
  • 3
    The issue with `idxmax()` is that it will return `0` for a full-`NaN` column. I would expect a `NaN` in that case, so I would rather always use `.apply(Series.first_valid_index)`. – Peque Nov 02 '20 at 19:32
  • 1
    For the whole dataframe you can find the first index that has no NaNs with `df.apply(pd.Series.first_valid_index).max()` – pseudoabdul Aug 18 '22 at 08:51
1

A convenience function based on behzad.nouri's commend and cs95's earlier answer. Any errors or misunderstandings are mine.

import pandas as pd
import numpy as np

df = pd.DataFrame([["2022-01-01", np.nan, np.nan, 1], ["2022-01-02", 2, np.nan, 2], ["2022-01-03", 3, 3, 3], ["2022-01-04", 4, 4, 4], ["2022-01-05", np.nan, 5, 5]], columns=['date', 'A', 'B', 'C'])
df['date'] = pd.to_datetime(df['date'])

df
#        date    A    B    C
#0 2022-01-01  NaN  NaN  1.0
#1 2022-01-02  2.0  NaN  2.0
#2 2022-01-03  3.0  3.0  3.0
#3 2022-01-04  4.0  4.0  4.0
#4 2022-01-05  NaN  5.0  5.0

We want to start at the earliest date common to A and B and end at the latest date common to A and B (for whatever reason, we do not filter by column C).

# filter data to minimum/maximum common available dates
def get_date_range(df, cols):
    """return a tuple of the earliest and latest valid data for all columns in the list"""
    a,b = df[cols].apply(pd.Series.first_valid_index).max(), df[cols].apply(pd.Series.last_valid_index).min()
    return (df.loc[a, 'date'], df.loc[b, 'date'])

a,b = get_date_range(df, cols=['A', 'B'])
a
#Timestamp('2022-01-03 00:00:00')
b
#Timestamp('2022-01-04 00:00:00')

Now filter the data:

df.loc[(df.date >= a) & (df.date <= b)]
#        date    A    B    C
#2 2022-01-03  3.0  3.0  3
#3 2022-01-04  4.0  4.0  4
PatrickT
  • 10,037
  • 9
  • 76
  • 111