0

I have a python dataframe with hourly values for Jan 2015 except some hours are missing the index and values both. Ideally the dataframe with columns named "dates" and "values" should have 744 rows in it. However, it has randomly missing 10 hours and hence has only 734 rows. I want to interpolate for missing hours in the month to create the desired dataframe with 744 "dates" and 744 "values".

Edit:

I am new to python so I am struggling with implementing this idea:

  • Create a dataframe with first column as all hours in Jan 2015
  • Create the second column of same size as first of NANs
  • Fill the second column with available values hence the missing hours have NANs in them
  • Use the panda interpolate funtion

Edit2:

I was looking for hint for code snippets. Based on suggestion below I was able to create the following code but it fails to fill in the values which are zeros at the start of the month i.e. for hours 1 through 5 on Jan 1.

import panda as pd
st_dt   =   '2015-01-01'
en_dt   =   '2015-01-31'
DateTimeHour =   pd.date_range( pd.Timestamp( st_dt ).date(), pd.Timestamp(    
en_dt ).date(), freq='H')
Pwr.index    =   pd.DatetimeIndex(Pwr.index) #Pwr is the original dataframe
Pwr          =   Pwr.reindex( DateTimeHour, fill_value = 0 )
Pwr2         =   pd.Series( Pwr.values )
Pwr2.interpolate( imit_direction='both' )
Zanam
  • 4,607
  • 13
  • 67
  • 143
  • What are you asking for exactly? A fully-working solution? Hints on how to write the code yourself? The idea you have in mind seems reasonable to be, are you stuck with something in particular? Do you want to check whether your idea is the best or not? – Andrea Corbellini Dec 18 '15 at 19:43

3 Answers3

3

Use df.asfreq to expand the DataFrame so as to have an hourly frequency. NaN is inserted for missing values:

df = df.asfreq('H')

then use df.interpolate to replace the NaNs with (linearly) interpolated values based on the DatetimeIndex and the nearest non-NaN values:

df = df.interpolate(method='time')

For example,

import numpy as np
import pandas as pd

N, M = 744, 734
index = pd.date_range('2015-01-01', periods=N, freq='H')
idx = np.random.choice(np.arange(N), M, replace=False)
idx.sort()
index = index[idx]

# This creates a toy DataFrame with 734 non-null rows:
df = pd.DataFrame({'values': np.random.randint(10, size=(M,))}, index=index)

# This expands the DataFrame to 744 rows (10 null rows):
df = df.asfreq('H')

# This makes `df` have 744 non-null rows:
df = df.interpolate(method='time')
unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677
1

What you want requires a combination of this technique: Add missing dates to pandas dataframe

And the pandas function pandas.Series.interpolate. From what you've said, the option 'linear' is what you want.

EDIT:
Interpolate will not work in the case were you have datapoints missing at the very start of the time series. One idea is to use pandas.Series.fillna with 'backfill' after the interpolation. Also, do not set fill_value to 0 whe you call reindex

Community
  • 1
  • 1
0

A general interpolation is the following:

If the key exits:

  • Return the value

else:

  • Find the first key before and after the required key, find the distance (which you can define using a desired metric) to both keys and take a weighted average of the values, weighed by the distances of the keys (close is heigher weight).
Hidde
  • 11,493
  • 8
  • 43
  • 68