1

I work with dataframes with this kind of structure:

[timestamp][id][A][B][C][D]

due to a number of things, not all data is retrieved and i end up with something like

2015-05-05 15:00:00   2     NaN     NaN     NaN    2680 
2015-05-05 15:00:00   3     0989      0020     -0011    2680
2015-05-05 15:00:00   4    1022      0060     -0076    2600 
2015-05-05 15:00:00   5     NaN     NaN     NaN    2623 

in this case, using fillna(method="ffill") will use the 'A' value from id=4, to fill the 'A' value of id=5.

What i want to accomplish however, is to fill the 'A' value of id=5 from its previously known value somewhere along the (700000) csv entries. (and fill all NaN values from all columns this way)

2014-07-24 17:49:00   5   1046.0   -3.0   -239.0   2800.0
...
2015-05-05 15:00:00   2     NaN     NaN     NaN    2680 
2015-05-05 15:00:00   3     0989      0020     -0011    2680
2015-05-05 15:00:00   4    1022      0060     -0076    2600 
2015-05-05 15:00:00   5     NaN     NaN     NaN    2623 

becomes

2015-05-05 15:00:00   2     NaN     NaN     NaN    2680 
2015-05-05 15:00:00   3     0989      0020     -0011    2680
2015-05-05 15:00:00   4    1022      0060     -0076    2600 
2015-05-05 15:00:00   5     1046     -3.0     -239.0    2623 

with values coming from the last known data of id=5 which is 2014-07-24 17:49:00

instead of

2015-05-05 15:00:00   2     NaN     NaN     NaN    2680 
2015-05-05 15:00:00   3     0989      0020     -0011    2680
2015-05-05 15:00:00   4    1022      0060     -0076    2600 
2015-05-05 15:00:00   5     1022     0060     -0076    2623 

which is what happens when i do the fillna

my second solution was to do a groupby 'id' and then do a fillna(). but that was way over my skill as a pandas user. so I was just wondering if there are any better solutions out there.

carlo
  • 193
  • 1
  • 2
  • 14
  • Why not just sort by Timestamp the do 'ffill'? – Liam Foley May 26 '15 at 07:24
  • "What i want to accomplish however, is to fill the 'A' value of id=5 from its previously known value" isn't it's previously known value the A value from id=4? Could you post the expected output? – Nadine May 26 '15 at 07:25
  • @highonprogramming edited to reflect desired results vs results of attempts – carlo May 26 '15 at 07:49
  • @LiamFoley not quite sure what you mean by that, the dataframe is indexed by the timestamp. – carlo May 26 '15 at 07:49
  • Could you not sort by the timstamp and column A and ffill? – EdChum May 26 '15 at 07:53
  • @EdChum If there were any nan values in the middle of the dataframe, that didn't have a previous value, they would be filled with the value from the previous line (which would belong to a different id). – Nadine May 26 '15 at 10:33

1 Answers1

1

First I group by the id, and then within that group, I fill the nans with the previous value.

df = pd.DataFrame([['2014-07-24 17:49:00', 5, 1046.0, -3, -239, 2800],
                   ['2015-05-05 15:00:00', 2, np.nan, np.nan,np.nan, 2680],
                   ['2015-05-05 15:00:00', 3,  989, 20, -11, 2680], 
                   ['2015-05-05 15:00:00', 4, 1022, 60, -76, 2600], 
                   ['2015-05-05 15:00:00', 5, np.nan, np.nan, np.nan, 2623]], 
                  columns='timestamp id A B C D'.split())

df.groupby('id').apply(lambda group: group.fillna(method='ffill'))

Outcome:

             timestamp  id     A   B    C     D
0  2014-07-24 17:49:00   5  1046  -3 -239  2800
1  2015-05-05 15:00:00   2   NaN NaN  NaN  2680
2  2015-05-05 15:00:00   3   989  20  -11  2680
3  2015-05-05 15:00:00   4  1022  60  -76  2600
4  2015-05-05 15:00:00   5  1046  -3 -239  2623

Edit: This was a suggested edit by JoeCondron, that's probably a better answer than mine. I'm going to add it here until he adds it as an answer, because I don't want it to be lost:

df.groupby('id', as_index=False).fillna(method='ffill')
Nadine
  • 1,620
  • 2
  • 15
  • 27
  • so that's how you do it. quick question, what's the `lambda` for? and to anyone who's gonna stumble into this question in the future, do this before assigning the timestamp as a datetimeindex, i got a keyerror from doing that. anyway, thanks again. marking this as answer. – carlo May 27 '15 at 08:27
  • The lambda is used to create an anonymous function (a function without a name). Alternatively, you could pass a named function to the apply() function. The function I created has a parameter called "group". This parameter will be filled by each set of related records, and the fillna function will be applied to each group separately. Does that answer your question? – Nadine Jun 01 '15 at 08:32