3

I have an array with missing values in various places.

import numpy as np
import pandas as pd
x = np.arange(1,10).astype(float)
x[[0,1,6]] = np.nan
df = pd.Series(x)
print(df)

0    NaN
1    NaN
2    3.0
3    4.0
4    5.0
5    6.0
6    NaN
7    8.0
8    9.0
dtype: float64

For each NaN, I want to take the value proceeding it, an divide it by two. And then propogate that to the next consecutive NaN, so I would end up with:

0    0.75
1    1.5
2    3.0
3    4.0
4    5.0
5    6.0
6    4.0
7    8.0
8    9.0
dtype: float64

I've tried df.interpolate(), but that doesn't seem to work with consecutive NaN's.

jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
BobbyJohnsonOG
  • 141
  • 1
  • 10
  • even if `interpolate()` did work, it wouldn't do what you need. And by the way your "interpolation" rule seems quite weird. Are you sure that this is the way you want to do it? – Ma0 Aug 24 '16 at 11:37
  • @Ev.Kounis I'm not entirely sure this is the method I want, but right now I am just replicating what someone else has done with their data. Then I'll figure out a better way. In reality, I should be doing a curve-fitting exercise on the data to predict the missing values. – BobbyJohnsonOG Aug 24 '16 at 12:18
  • 1
    what is typically done is that you assume the missing segment to be a straight line and based on the closest available points before and after the 'NaN' you calculate a value. This is what's called linear interpolation (see https://en.wikipedia.org/wiki/Linear_interpolation) – Ma0 Aug 24 '16 at 12:23
  • @Ev.Kounis Does linear interpolation work with consecutive NaN's, or NaN's that are at the beginning or end of a series? If so, why doesn't Pandas interpolation(method='linear', axis=1, limit_direction='both') work when I tried it before? It doesn't seem to touch NaN's at the beginning or end of my series. – BobbyJohnsonOG Aug 24 '16 at 12:27
  • 1
    to do those you would have to **extra**polate. that is "guess" values that are **out**side a given range based on extending what comes next or came before. – Ma0 Aug 24 '16 at 12:37
  • @Ev.Kounis, thanks for that - found a good answer on extrapolating in Pandas here https://stackoverflow.com/questions/22491628/extrapolate-values-in-pandas-dataframe – BobbyJohnsonOG Aug 24 '16 at 12:45
  • No problem. Cheers! – Ma0 Aug 24 '16 at 12:47

2 Answers2

3

Another solution with fillna with method ffill, what it same as ffill() function:

#back order of Series
b = df[::-1].isnull()
#find all consecutives NaN, count them, divide by 2 and replace 0 to 1
a = (b.cumsum() - b.cumsum().where(~b).ffill()).mul(2).replace({0:1})

print(a)
8    1
7    1
6    2
5    1
4    1
3    1
2    1
1    2
0    4
dtype: int32

print(df.bfill().div(a))
0    0.75
1    1.50
2    3.00
3    4.00
4    5.00
5    6.00
6    4.00
7    8.00
8    9.00
dtype: float64

Timings (len(df)=9k):

In [315]: %timeit (mat(df))
100 loops, best of 3: 11.3 ms per loop

In [316]: %timeit (jez(df1))
100 loops, best of 3: 2.52 ms per loop

Code for timings:

import numpy as np
import pandas as pd
x = np.arange(1,10).astype(float)
x[[0,1,6]] = np.nan
df = pd.Series(x)
print(df)
df = pd.concat([df]*1000).reset_index(drop=True)
df1 = df.copy()

def jez(df):
    b = df[::-1].isnull()
    a = (b.cumsum() - b.cumsum().where(~b).ffill()).mul(2).replace({0:1})
    return (df.bfill().div(a))

def mat(df):
    prev = 0
    new_list = []
    for i in df.values[::-1]:
        if np.isnan(i):
            new_list.append(prev/2.)    
            prev = prev / 2.
        else:
            new_list.append(i)
            prev = i
    return pd.Series(new_list[::-1])

print (mat(df))
print (jez(df1))
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
2

You can do something like this:

import numpy as np
import pandas as pd
x = np.arange(1,10).astype(float)
x[[0,1,6]] = np.nan
df = pd.Series(x)

prev = 0
new_list = []
for i in df.values[::-1]:
    if np.isnan(i):
        new_list.append(prev/2.)    
        prev = prev / 2.
    else:
        new_list.append(i)
        prev = i
df = pd.Series(new_list[::-1])

It loops over the values of the df, in reverse. It keeps track of the previous value. It adds the actual value if it is not NaN, otherwise the half of the previous value.

This might not be the most sophisticated Pandas solution, but you can change the behavior quite easy.

Mathias711
  • 6,568
  • 4
  • 41
  • 58