1

I have mydf below, which I have sorted on a dummy time column and the id:

mydf = pd.DataFrame(
    {
        'id': ['A', 'B', 'B', 'C', 'A', 'C', 'A'],
        'time': [1, 4, 3, 5, 2, 6, 7],
        'val': ['a', 'b', 'c', 'd', 'e', 'f', 'g']
    }
).sort_values(['id', 'time'], ascending=False)

mydf
    id  time    val
5   C   6       f
3   C   5       d
1   B   4       b
2   B   3       c
6   A   7       g
4   A   2       e
0   A   1       a

I want to add a column (last_val) which, for each unique id, holds the latest val based on the time column. Entries for which there is no last_val can be dropped. The output in this example would look like:

mydf
    id  time    val last_val
5   C   6       f   d
1   B   4       b   c
6   A   7       g   e
4   A   2       e   a 

Any ideas?

CHRD
  • 1,917
  • 1
  • 15
  • 38

1 Answers1

2

Use DataFrameGroupBy.shift after sort_values(['id', 'time'], ascending=False) (already in question) and then remove rows with missing values by DataFrame.dropna:

mydf['last_val'] = mydf.groupby('id')['val'].shift(-1)
mydf = mydf.dropna(subset=['last_val'])

Similar solution, only removed last duplicated rows by id column:

mydf['last_val'] = mydf.groupby('id')['val'].shift(-1)
mydf = mydf[mydf['id'].duplicated(keep='last')]

print (mydf)
  id  time val last_val
5  C     6   f        d
1  B     4   b        c
6  A     7   g        e
4  A     2   e        a
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252