10

I have a dataframe that looks like the following:

    s1        s2       s3       s4
0   v1        v2       v3       v4
0   v5        v6       v7       np.nan
0   v8      np.nan     v9       np.nan
0   v10     np.nan     np.nan   np.nan

Essentially from top down there are numerical values and across columns at some random index values will switch to np.nan only.

I've used .apply(pd.Series.last_valid_index) to get the indexes for which the values are still numerical, however, I'm not sure of the most efficient way to retrieve a series for which I have the actual value at the last valid index.

Ideally I'd be able to derive a series that looks like:

   value
s1 v10
s2 v6
s3 v9
s4 v4

or as a dataframe that looks like

   s1 s2 s3 s4
0 v10 v6 v9 v4

Many thanks!

marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
wingsoficarus116
  • 429
  • 5
  • 17

4 Answers4

7

Here is another way to do it, without resetting the index:

df.apply(lambda x: x[x.notnull()].values[-1])

s1    v10
s2     v6
s3     v9
s4     v4
sacuL
  • 49,704
  • 8
  • 81
  • 106
  • 1
    Nice solution +1, would you know if `last_valid_index` is more or less efficient than Boolean indexing via `pd.Series.notnull` ? – jpp Jun 14 '18 at 16:00
  • Thanks, Not sure, but it should be easy to test... give me a second – sacuL Jun 14 '18 at 16:01
  • 1
    Based on a super crude test on a large df, I think that `last_valid_index` is actually quite a bit faster than `notnull()` – sacuL Jun 14 '18 at 16:09
  • Thanks for testing. Yeh I thought so, maybe `last_valid_index` goes backwards like a generator rather than testing each and every item. – jpp Jun 14 '18 at 16:10
  • `x[x.notnull()]` can be replaced with `x.dropna()`. – EliadL Jul 08 '21 at 09:46
6

This is one way using NumPy indexing:

# ensure index is normalised
df = df.reset_index(drop=True)

# calculate last valid index across dataframe
idx = df.apply(pd.Series.last_valid_index)

# create result using NumPy indexing
res = pd.Series(df.values[idx, np.arange(df.shape[1])],
                index=df.columns,
                name='value')

print(res)

s1    v10
s2     v6
s3     v9
s4     v4
Name: value, dtype: object
jpp
  • 159,742
  • 34
  • 281
  • 339
2

You need to normalize the index, find the last valid index per column and create a dataframe from it.

df = df.reset_index(drop=True)
ser = df.apply(lambda x: x.last_valid_index())
pd.DataFrame([df[col][ser[col]] for col in df.columns], index=df.columns).T

Output:

     s1 s2  s3  s4
0   v10 v6  v9  v4

Also, this way, your original df stays intact.

harvpan
  • 8,571
  • 2
  • 18
  • 36
  • Thanks for the quick reply! I tried this but I get the following error: TypeError: cannot do label indexing on with these indexers [nan] of – wingsoficarus116 Jun 14 '18 at 15:51
  • Are you sure, it is coming from my code? Which line? What version are you running for pandas? – harvpan Jun 14 '18 at 15:53
  • It seems the be the second line: Traceback (most recent call last): File "", line 26, in print(pd.DataFrame([df.reset_index()[col][ser[col]] for col in df.columns], index=df.columns).T) TypeError: cannot do label indexing on with these indexers [nan] of – wingsoficarus116 Jun 14 '18 at 15:56
  • `df.reset_index()[col].iloc[ser[col]]`, does that work for you? If so, I would update my answer. I am using '0.23.0' version for pandas. It works for me. Which version are you using? – harvpan Jun 14 '18 at 16:01
  • I figured it out, I had to drop columns where all the values were np.nan - then it works like a charm! Thank you very much! – wingsoficarus116 Jun 14 '18 at 16:04
0

Here is a way using groupby()

df.stack().groupby(level=1).last()

Output:

s1    v10
s2     v6
s3     v9
s4     v4

and as a df:

df.stack().groupby(level=1).last().to_frame().T

Output:

    s1  s2  s3  s4
0  v10  v6  v9  v4
rhug123
  • 7,893
  • 1
  • 9
  • 24