0

I've got a dataset with multiple missing sequences of varying lengths where I'd like to find the first valid numbers that occur before and after these sequences for some particular dates. In the sample dataset below, I would like to find the valid numbers for ColumnB that occur closest to the date 2018-11-26.

Datasample:

Date         ColumnA   ColumnB
2018-11-19   107.00      NaN
2018-11-20   104.00      NaN
2018-11-21   106.00      NaN
2018-11-22   105.24    80.00
2018-11-23   104.63      NaN
2018-11-26   104.62      NaN
2018-11-28   104.54      NaN
2018-11-29   103.91    86.88
2018-11-30   103.43      NaN
2018-12-01   106.13      NaN
2018-12-02   110.83      NaN

Expected output:

[80, 86.88]

Some details:

If it were the case that this particular sequence was the only one with missing values, I would have been able to solve it using For Loops, or the pandas functions first_valid_index() or isnull() as described in Pandas - find first non-null value in column, but that will rarely be the case.

I'm able to solve this using a few For Loops, but it's very slow for larger datasets and not very elegant, so I'd really like to hear other suggestions!

vestland
  • 55,229
  • 37
  • 187
  • 305
  • 1
    The close to 2018-11-16 is 80.00 why do you get a second value? Oh. I think you ment 2018-11-26? – Scott Boston Dec 11 '18 at 14:14
  • @ScottBoston You're absolutely right! I've edited the question. – vestland Dec 11 '18 at 14:16
  • If you want to find the *closest* non-null number, based on the `Date` then this is really just an interpolation problem: `df.set_index('Date').ColumnB.interpolate('nearest').ffill().bfill()`. If it's interpolation just based on index, then no need to `.set_index` – ALollz Dec 11 '18 at 14:53

5 Answers5

2

Try this way, get the index and slice to get the first valid number

idx= np.where(df['Date']=='2018-11-26')[0][0]
# idx 3

num = (df.loc[df.loc[:idx,'ColumnB'].first_valid_index(),'ColumnB'],
       df.loc[df.loc[idx:,'ColumnB'].first_valid_index(),'ColumnB'])

num
(80.0, 86.879999999999995)
Bharath M Shetty
  • 30,075
  • 6
  • 57
  • 108
2

You can use ffill and bfill to create two columns with the value from before and after such as

df['before'] = df.ColumnB.ffill()
df['after'] = df.ColumnB.bfill()

then get the value for the dates you want with a loc

print (df.loc[df.Date == pd.to_datetime('2018-11-26'),['before','after']].values[0].tolist())
[80.0, 86.88]

and if you have a list of dates then you can use isin:

list_dates = ['2018-11-26','2018-11-28']
print (df.loc[df.Date.isin(pd.to_datetime(list_dates)),['before','after']].values.tolist())
[[80.0, 86.88], [80.0, 86.88]]
Ben.T
  • 29,160
  • 6
  • 32
  • 54
2

I'd try it this way:

import pandas as pd
import numpy as np

df_vld = df.dropna()

idx = np.argmin(abs(df_vld.index - pd.datetime(2018, 11,26)))
# 1

df_vld.loc[df_vld.index[idx]]
Out:
ColumnA    103.91
ColumnB     86.88
Name: 2018-11-29 00:00:00, dtype: float64
SpghttCd
  • 10,510
  • 2
  • 20
  • 25
2
[df['ColumnB'].ffill().loc['2018-11-26'], df['ColumnB'].bfill().loc['2018-11-26']]
Brian
  • 1,572
  • 9
  • 18
1

Here's a way to do it:

t = '2018-11-26'

Look for the index of the date t:

ix = df.loc[df.Date==t].index.values[0]

Keep positions of non-null values in ColumnB:

non_nulls = np.where(~df.ColumnB.isnull())[0]

Get the nearest non-null values both on top and bellow:

[df.loc[non_nulls[non_nulls < ix][-1],'ColumnB']] + [df.loc[non_nulls[non_nulls > ix][0],'ColumnB']]

[80.0, 86.88]
yatu
  • 86,083
  • 12
  • 84
  • 139