0

Let us assume we have a Dataframe with randomly placed NaNs (sometimes even only-NaN rows). Are there already established ways/methods to interpolate with influence of both, rows and columns at the same time? (vectorized)

import pandas as pd, numpy as np
df = pd.DataFrame(np.random.randn(100000, 4),  
                  columns=['one', 'two', 'three', 'four'])

df = df.mask(np.random.random(df.shape) < .1)
print(df)

>>          one       two     three      four
0      0.328574  0.460837 -1.242114  0.871454
1     -1.155524  0.911798  0.733518  1.355840
2     -0.482975       NaN -0.688304  0.015186
3     -0.714028 -2.133300       NaN  1.074630
4     -0.789536 -0.330372  1.158331 -0.571878
        ...       ...       ...       ...
99995 -0.030537  0.160436 -2.085611       NaN
99996 -0.690557       NaN -2.499389  0.044560
99997  0.150332 -1.188956       NaN -1.645208
99998  1.124226  0.443667  1.543553  0.469025
99999 -2.084317 -0.056264 -0.389893 -0.743672

[100000 rows x 4 columns]
benjamin_z
  • 41
  • 3
  • If you're talking about linear interpolation, it's not necessary. The rows already have the "increment" from the rows above and below, so interpolating along the row with the values you do have will produce the right value. – Tim Roberts Mar 17 '22 at 19:54
  • Like an 2D image, you can use interpolation. Check https://stackoverflow.com/a/39596856/15239951 – Corralien Mar 17 '22 at 20:01
  • @TimRoberts could you clarify what you mean by "increments"? Do you mean something like a "context" alredy included in the values of the row? – benjamin_z Mar 18 '22 at 11:02
  • @Corralien is that possible with the mode's of the pandas function df.interpolate(method='...') ? [link](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.interpolate.html) (or alternatively with [link](https://docs.scipy.org/doc/scipy/reference/generated/scipy.interpolate.interp2d.html( ??) – benjamin_z Mar 18 '22 at 11:02
  • If the rows are linearly related to each other, then you can get the proper replacement for a missing value EITHER by interpolating from left and right, OR by interpolating from top and bottom, because the left/right values already include the row difference. It doesn't work in your example, because 400,000 random values are not linearly related. – Tim Roberts Mar 18 '22 at 20:40
  • @TimRoberts if they are linearly related but not related enough, e.g only correlated. Or: if the columns show start-value, minimum-value, maximum-value and end-value for the timestamp, would interpolating along the row make sense? – benjamin_z Mar 25 '22 at 21:34
  • Look, the ONLY way you can reasonably fill in a missing value is if you know how the values in the rows are related to each other. If there is no linear relationship, then you might as well just choose the average for the row, or a random number between min and max. – Tim Roberts Mar 25 '22 at 23:01

0 Answers0