0

I want to fill NAs using the most recent data and have it apply to groups.

This is my dataframe

pet    id     weight (lbs)
dog    1      30
dog    2      23
dog    3      NaN
cat    4      10
dog    5      NaN
cat    6      NaN
dog    7      39
cat    8      18
hippo  9      138

This is what I want the output to be

pet    id     weight (lbs)
dog    1      30
dog    2      23
dog    3      23
cat    4      10
dog    5      23
cat    6      10
dog    7      39
cat    8      18
hippo  9      138

This is the code for reproducing the dataframe-

df = pd.DataFrame({'pets':['dog', 'dog', 'dog', 'cat', 'dog', 'cat', 'dog', 'cat', 'hippo'],
                   'id':[1, 2, 3, 4, 5, 6, 7, 8, 8],
                   'Weight':[30, 23, np.nan, 10, np.nan, np.nan, 39, 10, 138]})

In other words, I want to fill in NaNs with the most recent non-null value grouped by pet and order by id. This is the code I tried to use: dataframe.sort_values(by = 'id').groupby('pet').fillna(method = 'ffill')

ansev
  • 30,322
  • 5
  • 17
  • 31
Cauder
  • 2,157
  • 4
  • 30
  • 69
  • 1
    what is the problem? –  Feb 28 '22 at 21:40
  • I ran the code I mentioned and it looks like it worked fine. But, now my 'pet' column is gone. In other words, it filled in the data the right way, but it looks like it deleted the column I grouped on – Cauder Feb 28 '22 at 21:47

1 Answers1

0

The problem with the code you have is, pandas GroupBy Object doesn't have sort_values method.

So,first group the dataframe by pet, then apply a function which does forward-fill on each group by sorting the dataframe on id.

The reamaining part of sort_index, and reset_index is there just to get the resulting dataframe in the initial order of pet, and id columns.

out = (df.groupby(['pet'])
        .apply(lambda x: x.sort_values('id').ffill())
        .sort_index(level=1)
        .reset_index(0, drop=True)
         )

OUTPUT:

>>> out
     pet  id  weight (lbs)
0    dog   1          30.0
1    dog   2          23.0
2    dog   3          23.0
3    cat   4          10.0
4    dog   5          23.0
5    cat   6          10.0
6    dog   7          39.0
7    cat   8          18.0
8  hippo   9         138.0
ThePyGuy
  • 17,779
  • 5
  • 18
  • 45
  • Thanks, so what does reset index do? Is that making sure that the grouping column is still there in the output – Cauder Feb 28 '22 at 21:53
  • You can try running the code step by step to see how the result appears after each step. first step is grouping, after that `apply`, after that `sort_index`, and after that `rest_index` – ThePyGuy Feb 28 '22 at 21:54
  • WHY NOT `df.groupby('pet').ffill()`? – ansev Feb 28 '22 at 21:56
  • @ansev, OP wants FFILL based on the value of last maximum id – ThePyGuy Feb 28 '22 at 22:01
  • what is the problem? if we first order the entire dataframe by id when creating the groups they will always be ordered... ordering them for each group is inefficient, check my solution @ThePyGuy – ansev Feb 28 '22 at 22:02
  • @Cauder you only need keep all columns, not apply – ansev Feb 28 '22 at 22:05
  • I don't understand what you mean by keep all columns, not apply – Cauder Mar 02 '22 at 15:00
  • the problem is if you use `df.sort_values('id').groupby('pet').ffill()` pet columns is deleted, but you don't need sort by group and you can keep the original order using`assign` or `.join`.Here assign is more elegant than join' @Cauder – ansev Mar 02 '22 at 15:47
  • We can first sort the dataframe and then We can use GroupBy.ffill . In this way we avoid to use apply, see https://stackoverflow.com/questions/54432583/when-should-i-not-want-to-use-pandas-apply-in-my-code @Cauder – ansev Mar 02 '22 at 15:54