5

If I have a df similar to this one:

print(df)
                       A  B  C    D    E
 DATE_TIME                               
2016-08-10 13:57:00  3.6  A  1  NaN  NaN
2016-08-10 13:58:00  4.7  A  1  4.5  NaN
2016-08-10 13:59:00  3.4  A  0  NaN  5.7
2016-08-10 14:00:00  3.5  A  0  NaN  NaN
2016-08-10 14:01:00  2.6  A  0  4.6  NaN
2016-08-10 14:02:00  4.8  A  0  NaN  4.3
2016-08-10 14:03:00  5.7  A  1  NaN  NaN
2016-08-10 14:04:00  5.5  A  1  5.7  NaN
2016-08-10 14:05:00  5.6  A  1  NaN  NaN
2016-08-10 14:06:00  7.8  A  1  NaN  5.2
2016-08-10 14:07:00  8.9  A  0  NaN  NaN
2016-08-10 14:08:00  3.6  A  0  NaN  NaN

print (df.dtypes)
A    float64
B     object
C      int64
D    float64
E    float64
dtype: object

Thanks to a lot of input from the community I have this code now which allows me to upsample my df to second intervals, applying different methods to different dtypes

int_cols = df.select_dtypes(['int64']).columns
index = pd.date_range(df.index[0], df.index[-1], freq="s")
df2 = df.reindex(index)

for col in df2:
if col == int_cols.all(): 
    df2[col].ffill(inplace=True)
    df2[col] = df2[col].astype(int)
elif df2[col].dtype == float:
    df2[col].interpolate(inplace=True)
else:
    df2[col].ffill(inplace=True)

I am looking for a way now, to only interpolate between my actual measurements. The interpolate function extends my last measurement until the end of the df:

 df2.tail()
Out[75]: 
                            A  B  C    D    E
2016-08-10 14:07:56  3.953333  A  0  5.7  5.2
2016-08-10 14:07:57  3.865000  A  0  5.7  5.2
2016-08-10 14:07:58  3.776667  A  0  5.7  5.2
2016-08-10 14:07:59  3.688333  A  0  5.7  5.2
2016-08-10 14:08:00  3.600000  A  0  5.7  5.2

But I would like to stop this when the last measurement took place (for example at 14:04:00 col['D'] and 14:06:00 col['D']) and leave the NaNs.

It tried adding a zero value for 'limit' and 'limit_direction' to 'both':

 for col in df2:
if col == int_cols.all(): 
    df2[col].ffill(inplace=True)
    df2[col] = df2[col].astype(int)
elif df2[col].dtype == float:
    df2[col].interpolate(inplace=True,limit=0, limit_direction='both')
else:
    df2[col].ffill(inplace=True)

but this didn't change anything to the output. I than tried to incorporate the solution I found to this question: Pandas: interpolation where first and last data point in column is NaN into my code:

for col in df2:
if col == int_cols.all(): 
    df2[col].ffill(inplace=True)
    df2[col] = df2[col].astype(int)
elif df2[col].dtype == float:
   df2[col].loc[df2[col].first_valid_index(): df2[col].last_valid_index()]=df2[col].loc[df2[col].first_valid_index(): df2[col].last_valid_index()].astype(float).interpolate(inplace=True)
else:
    df2[col].ffill(inplace=True)

...but that did not work and my float64 columns are purely NaNs now...Also, the way I tried to insert the code, I know it would only have affected the float columns. In an ideal solution I would hope to do the set this first_valid_index():.last_valid_index() selection also to the object and int64 columns. Can somebody help me? ..thank you

Community
  • 1
  • 1
vera
  • 297
  • 3
  • 11

3 Answers3

5

For pandas 0.23.0 is possible use parameter limit_area in interpolate:

df = pd.DataFrame({'A': [np.nan, 1.0, np.nan, np.nan, 4.0, np.nan, np.nan],
                   'B': [np.nan, np.nan, 0.0, np.nan, np.nan, 2.0, np.nan]},
                  columns=['A', 'B'], 
                  index=pd.date_range(start='2016-08-10 13:50:00', periods=7, freq='S'))
print (df)
                       A    B
2016-08-10 13:50:00  NaN  NaN
2016-08-10 13:50:01  1.0  NaN
2016-08-10 13:50:02  NaN  0.0
2016-08-10 13:50:03  NaN  NaN
2016-08-10 13:50:04  4.0  NaN
2016-08-10 13:50:05  NaN  2.0
2016-08-10 13:50:06  NaN  NaN

df = df.interpolate(limit_direction='both', limit_area='inside')
print (df)
                       A         B
2016-08-10 13:50:00  NaN       NaN
2016-08-10 13:50:01  1.0       NaN
2016-08-10 13:50:02  2.0  0.000000
2016-08-10 13:50:03  3.0  0.666667
2016-08-10 13:50:04  4.0  1.333333
2016-08-10 13:50:05  NaN  2.000000
2016-08-10 13:50:06  NaN       NaN
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
2

You were very close! Here's an example to get the point across that's very similar to the code you posted toward the end of your post:

import numpy as np
import pandas as pd

df = pd.DataFrame({'A': [np.nan, 1.0, np.nan, np.nan, 4.0, np.nan, np.nan],
                   'B': [np.nan, np.nan, 0.0, np.nan, np.nan, 2.0, np.nan]},
                  columns=['A', 'B'], 
                  index=pd.date_range(start='2016-08-10 13:50:00', periods=7, freq='S'))
print df

A_first = df['A'].first_valid_index()
A_last = df['A'].last_valid_index()
df.loc[A_first:A_last, 'A'] = df.loc[A_first:A_last, 'A'].interpolate()

B_first = df['B'].first_valid_index()
B_last = df['B'].last_valid_index()
df.loc[B_first:B_last, 'B'] = df.loc[B_first:B_last, 'B'].interpolate()

print df

result:

                       A    B
2016-08-10 13:50:00  NaN  NaN
2016-08-10 13:50:01  1.0  NaN
2016-08-10 13:50:02  NaN  0.0
2016-08-10 13:50:03  NaN  NaN
2016-08-10 13:50:04  4.0  NaN
2016-08-10 13:50:05  NaN  2.0
2016-08-10 13:50:06  NaN  NaN

                       A         B
2016-08-10 13:50:00  NaN       NaN
2016-08-10 13:50:01  1.0       NaN
2016-08-10 13:50:02  2.0  0.000000
2016-08-10 13:50:03  3.0  0.666667
2016-08-10 13:50:04  4.0  1.333333
2016-08-10 13:50:05  NaN  2.000000
2016-08-10 13:50:06  NaN       NaN

The two problems in your code were:

  1. If you are going to do df[...] = df[...].interpolate(), you need to remove inplace=True since that will make it return None. That is your main problem and why you were getting all NaNs.
  2. Though it seems to work here, in general, chained indexing is bad:

You want:

df.loc[A_first:A_last, 'A'] = df.loc[A_first:A_last, 'A'].interpolate()

Not:

df['A'].loc[A_first:A_last] = df['A'].loc[A_first:A_last].interpolate()

See here for more detail: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

MarredCheese
  • 17,541
  • 8
  • 92
  • 91
  • Thank you so much for the explanations and the helpful link. I tried following line `df.loc[df['A'].first_valid_index():df['A'].last_valid_index()] = df.loc[df['A'].first_valid_index():df['A'].last_valid_index()].interpolate()` with your `df` to see what it would do and I was surprised that also column B changed. Why would that happen? I also changed the line from my code: `df2.loc[df2[col].first_valid_index():df2[col].last_valid_index()] = df2.loc[df[col].first_valid_index():df2[col].last_valid_index()].interpolate()`, which still wont work. Sorry for asking again, I am really trying to learn – vera Aug 31 '16 at 06:03
  • 1
    It's changing column B too because you didn't specify a column, so it's interpolating every column. Generically, what you need is `df.loc[, ] = df.loc[, ]`. Specifically, what you need is `df.loc[df['A'].first_valid_index():df['A'].last_valid_index(‌​), 'A'] = df.loc[df['A'].first_valid_index():df['A'].last_valid_index(‌​), 'A'].interpolate()` and `df2.loc[df2[col].first_valid_index():df2[col].last_valid_ind‌​ex(), col] = df2.loc[df[col].first_valid_index():df2[col].last_valid_inde‌​x(), col].interpolate()` – MarredCheese Aug 31 '16 at 18:38
  • Thank you! It is working now! I am bewildered though, that when I run this line (where I modified the one in the original answer) `df.loc[df['A'].first_valid_index():df['A'].last_valid_index(), 'A'] = df.loc[df['A'].first_valid_index():df['A'].last_valid_index(), 'A'].interpolate()` it works, but with this seemingly exact same line `df.loc[df['A'].first_valid_index():df['A'].last_valid_index(‌​‌​), 'A'] = df.loc[df['A'].first_valid_index():df['A'].last_valid_index(‌​‌​), 'A'].interpolate()` as a copy-past from your last edit I get `SyntaxError: invalid character in identifier.` Am I crazy? – vera Aug 31 '16 at 23:03
  • 1
    Weird, maybe some invisible character got in there during copying and pasting? http://stackoverflow.com/a/14844830/5405967 – MarredCheese Sep 01 '16 at 01:06
  • indeed....i copied in into notepad and it is `last_valid_index(??)` vs. `last_valid_index(??????)`....thanks a lot! – vera Sep 01 '16 at 01:17
1

You can backfill null values and then use boolean indexing to take the null values of each column (which must be the tail nulls).

for col in ['D', 'E']:
    idx = df[df[col].bfill().isnull()].index
    df[col].ffill(inplace=True)
    df.loc[idx, col] = None
Alexander
  • 105,104
  • 32
  • 201
  • 196
  • thank you! I am still trying to understand how to use it! I understand what it should be doing I think but I m not quite there yet.. – vera Aug 31 '16 at 06:05