filling last known data with pandas

Question

I work with dataframes with this kind of structure:

[timestamp][id][A][B][C][D]

due to a number of things, not all data is retrieved and i end up with something like

2015-05-05 15:00:00   2     NaN     NaN     NaN    2680 
2015-05-05 15:00:00   3     0989      0020     -0011    2680
2015-05-05 15:00:00   4    1022      0060     -0076    2600 
2015-05-05 15:00:00   5     NaN     NaN     NaN    2623

in this case, using fillna(method="ffill") will use the 'A' value from id=4, to fill the 'A' value of id=5.

What i want to accomplish however, is to fill the 'A' value of id=5 from its previously known value somewhere along the (700000) csv entries. (and fill all NaN values from all columns this way)

2014-07-24 17:49:00   5   1046.0   -3.0   -239.0   2800.0
...
2015-05-05 15:00:00   2     NaN     NaN     NaN    2680 
2015-05-05 15:00:00   3     0989      0020     -0011    2680
2015-05-05 15:00:00   4    1022      0060     -0076    2600 
2015-05-05 15:00:00   5     NaN     NaN     NaN    2623

becomes

2015-05-05 15:00:00   2     NaN     NaN     NaN    2680 
2015-05-05 15:00:00   3     0989      0020     -0011    2680
2015-05-05 15:00:00   4    1022      0060     -0076    2600 
2015-05-05 15:00:00   5     1046     -3.0     -239.0    2623

with values coming from the last known data of id=5 which is 2014-07-24 17:49:00

instead of

2015-05-05 15:00:00   2     NaN     NaN     NaN    2680 
2015-05-05 15:00:00   3     0989      0020     -0011    2680
2015-05-05 15:00:00   4    1022      0060     -0076    2600 
2015-05-05 15:00:00   5     1022     0060     -0076    2623

which is what happens when i do the fillna

my second solution was to do a groupby 'id' and then do a fillna(). but that was way over my skill as a pandas user. so I was just wondering if there are any better solutions out there.

"What i want to accomplish however, is to fill the 'A' value of id=5 from its previously known value" isn't it's previously known value the A value from id=4? Could you post the expected output? — Nadine, May 26 '15 at 07:25
@highonprogramming edited to reflect desired results vs results of attempts — carlo, May 26 '15 at 07:49
@LiamFoley not quite sure what you mean by that, the dataframe is indexed by the timestamp. — carlo, May 26 '15 at 07:49
@EdChum If there were any nan values in the middle of the dataframe, that didn't have a previous value, they would be filled with the value from the previous line (which would belong to a different id). — Nadine, May 26 '15 at 10:33

Nadine · Accepted Answer · 2015-06-01T08:20:25.653

First I group by the id, and then within that group, I fill the nans with the previous value.

df = pd.DataFrame([['2014-07-24 17:49:00', 5, 1046.0, -3, -239, 2800],
                   ['2015-05-05 15:00:00', 2, np.nan, np.nan,np.nan, 2680],
                   ['2015-05-05 15:00:00', 3,  989, 20, -11, 2680], 
                   ['2015-05-05 15:00:00', 4, 1022, 60, -76, 2600], 
                   ['2015-05-05 15:00:00', 5, np.nan, np.nan, np.nan, 2623]], 
                  columns='timestamp id A B C D'.split())

df.groupby('id').apply(lambda group: group.fillna(method='ffill'))

Outcome:

             timestamp  id     A   B    C     D
0  2014-07-24 17:49:00   5  1046  -3 -239  2800
1  2015-05-05 15:00:00   2   NaN NaN  NaN  2680
2  2015-05-05 15:00:00   3   989  20  -11  2680
3  2015-05-05 15:00:00   4  1022  60  -76  2600
4  2015-05-05 15:00:00   5  1046  -3 -239  2623

Edit: This was a suggested edit by JoeCondron, that's probably a better answer than mine. I'm going to add it here until he adds it as an answer, because I don't want it to be lost:

df.groupby('id', as_index=False).fillna(method='ffill')

so that's how you do it. quick question, what's the `lambda` for? and to anyone who's gonna stumble into this question in the future, do this before assigning the timestamp as a datetimeindex, i got a keyerror from doing that. anyway, thanks again. marking this as answer. — carlo, May 27 '15 at 08:27
The lambda is used to create an anonymous function (a function without a name). Alternatively, you could pass a named function to the apply() function. The function I created has a parameter called "group". This parameter will be filled by each set of related records, and the fillna function will be applied to each group separately. Does that answer your question? — Nadine, Jun 01 '15 at 08:32

filling last known data with pandas

1 Answers1

Linked