3

I have a df like

d = {'col1': [np.nan, np.nan, 1],
     'col2': [1, 1, 2],
     'col3': [2, 2, 3],
     'col4': [np.nan, 3, np.nan]}
df = pd.DataFrame(data=d)

and would like to extrapolate on the rows to fill any trailing nans.

Expected output:

d2 = {'col1': [np.nan, np.nan, 1],
      'col2': [1, 1, 2],
      'col3': [2, 2, 3],
      'col4': [3, 3, 4]}
df2 = pd.DataFrame(data=d2)

EDIT: The slope is different for each row. I have tried df.interpolate(method='linear') but that gives me a flat trend for trailing nans

  • Do you assume that all columns have the same underlying function for the extrapolation or do you treat each column independently? It is also unclear, why col1 is not interpolated. And what have you tried? There is for instance this: https://stackoverflow.com/a/48000949/8881141 – Mr. T May 31 '18 at 06:49
  • slope can be different for each row. See edit in question for what have been tried. The link you provided raises an exception `error: (m>k) failed for hidden m: fpcurf0:m=1` – Marcus Högenå Bohman May 31 '18 at 07:26
  • What do you expect from the interpolation? You provide one data point and it magically has to guess what kind of linear relationship you have in mind? y = 1 is also a linear relationship. And what about exploring the link, Wen provided to the documentation? interpolate is flexible and provides keywords like method and limit_direction to adapt it to your needs. – Mr. T May 31 '18 at 08:31
  • I extrapolate on rows, not columns, so I always have at least two values. The linear interpolation from `df.interpolation(method = 'linear', direction = 'forward', direction_limits = 'both')` gives a slope of 0 when extrapolating. Seems to be known bug: [link](https://github.com/pandas-dev/pandas/issues/8000)[link] – Marcus Högenå Bohman May 31 '18 at 10:41

1 Answers1

6

pandas.interpolate, which is mainly a wrapper for scipy's interpolation functions, has many keywords that allow you to adapt your interpolation. You could use spline:

d = {'col1': [np.nan, np.nan, 1, 5, 9, np.nan],
     'col2': [1, 1, 2, 5, 8, np.nan],
     'col3': [2, 2, 3, 4, 5, np.nan],
     'col4': [np.nan, 3, np.nan, 5, 6, np.nan]}
df = pd.DataFrame(data=d)

df = df.interpolate(method = "spline", order = 1, limit_direction = "both")
print(df)

Output:

   col1  col2  col3  col4
0  -7.0   1.0   2.0   2.0
1  -3.0   1.0   2.0   3.0
2   1.0   2.0   3.0   4.0
3   5.0   5.0   4.0   5.0
4   9.0   8.0   5.0   6.0
5  13.0   8.8   5.6   7.0

Edit:
There are probably more elegant solutions in pandas but here is one way to address the problem:

d = {'col1 Mar': [np.nan, np.nan, 1],
     'col2 Jun': [1, 1, 2],
     'col3 Sep': [2, 2, 3],
     'col4 Dec': [np.nan, 3, np.nan]}
df = pd.DataFrame(data=d)
print(df)
#store temporarily the column index
col_index = df.columns
#transcribe month into a number that reflects the time distance
df.columns = [3, 6, 9, 12]

#interpolate over rows
df = df.interpolate(method = "spline", order = 1,  limit_direction = "both", axis = 1, downcast = "infer")
#assign back the original index
df.columns = col_index
print(df)

Output:

   col1 Mar   col2 Jun  col3 Sep  col4 Dec
0       NaN          1         2       NaN
1       NaN          1         2       3.0
2       1.0          2         3       NaN
   col1 Mar   col2 Jun  col3 Sep  col4 Dec
0         0          1         2         3
1         0          1         2         3
2         1          2         3         4

If you provide the column index as a datetime object, you probably can use the column index directly, but I am not sure about it.

Edit 2: As expected, you can also use datetime objects as column names to interpolate:

CSV file

Mar 2014, Jun 2014, Sep 2014, Mar 2015
nan,        1,        2,      nan
nan,        1,        2,      4
1,          2,        3,      nan

Code:

#read CSV file
df = pd.read_csv("test.txt", sep = r',\s*')
#convert column names to datetime objects
df.columns = pd.to_datetime(df.columns)
#interpolate over rows
df = df.interpolate(method = "spline", order = 1,  limit_direction = "both", axis = 1, downcast = "infer")
print(df)

Output:

   2014-03-01  2014-06-01  2014-09-01  2015-03-01
0    0.000000         1.0         2.0    3.967391
1   -0.016457         1.0         2.0    4.000000
2    1.000000         2.0         3.0    4.967391

The results are now not nice, round integers anymore, because the amount of days in three months differ.

Mr. T
  • 11,960
  • 10
  • 32
  • 54
  • That is exactly what I want, but on rows instead of columns. I try to run it on my original `df`with `axis = 1` but I get an error message `TypeError: unsupported operand type(s) for -: 'str' and 'str'`. Could you include how you would do it on rows on my initial `df`? – Marcus Högenå Bohman May 31 '18 at 10:50
  • Your question is unclear. What does it mean to interpolate against the series of strings "col1, col2, col3, col4"? – Mr. T May 31 '18 at 11:27
  • Columns are months, indexes are vehicles and values are driven kilometers. I am trying to interpolate for missing values and also make a forecast of future kilometers (trailing `nan`s) – Marcus Högenå Bohman May 31 '18 at 12:32
  • You should always provide a [Minimal, Complete, and Verifiable example](https://stackoverflow.com/help/mcve). If the months are evenly spaced or datetime objects, the approach might be simpler. Please see the update. – Mr. T May 31 '18 at 13:35
  • 1
    I have added an example with datetime objects. – Mr. T May 31 '18 at 19:00