1

Here's what my data looks like:

my data looks like this

There are daily records, except for a gap from 2017-06-12 to 2017-06-16.

df2['timestamp'] = pd.to_datetime(df['timestamp'])
df2['timestamp'] = df2['timestamp'].map(lambda x: 
datetime.datetime.strftime(x,'%Y-%m-%d'))
df2 = df2.convert_objects(convert_numeric = True)
df2 = df2.groupby('timestamp', as_index = False).sum()

I need to fill this missing gap and others with values for all fields (e.g. timestamp, temperature, humidity, light, pressure, speed, battery_voltage, etc...).

How can I accomplish this with Pandas?

This is what I have done before

weektime = pd.date_range(start = '06/04/2017', end = '12/05/2017', freq = 'W-SUN')
df['week'] = 'nan'
df['weektemp'] = 'nan'
df['weekhumidity'] = 'nan'
df['weeklight'] = 'nan'
df['weekpressure'] = 'nan'
df['weekspeed'] = 'nan'
df['weekbattery_voltage'] = 'nan'

for i in range(0,len(weektime)):
    df['week'][i+1] = weektime[i]
    df['weektemp'][i+1] = df['temperature'].iloc[7*i+1:7*i+7].sum()
    df['weekhumidity'][i+1] = df['humidity'].iloc[7*i+1:7*i+7].sum()
    df['weeklight'][i+1] = df['light'].iloc[7*i+1:7*i+7].sum()
    df['weekpressure'][i+1] = df['pressure'].iloc[7*i+1:7*i+7].sum()
    df['weekspeed'][i+1] = df['speed'].iloc[7*i+1:7*i+7].sum()
    df['weekbattery_voltage'][i+1] = 
df['battery_voltage'].iloc[7*i+1:7*i+7].sum()
     i = i + 1

The value of sum is not correct. Cause the value of 2017-06-17 is a sum of 2017-06-12 to 2017-06-16. I do not want to add them again. This gap is not only one gap in the period. I want to fill all of them.

Yating Wu
  • 13
  • 5
  • 1
    There are quite a few links about filling missing datetimes. [Here is one](https://stackoverflow.com/questions/38361526/fill-the-missing-date-values-in-a-pandas-dataframe-column) but there are many more – Sotos Dec 07 '17 at 09:21
  • You didn't explain, what the underlying assumption is to fill missing data. All parameters except time are non-linear and change dramatically over a time course of five missing data points. – Mr. T Dec 07 '17 at 09:44
  • I need to sum 7 days data from each 1st data to the 7th data.(weekly) (e.g. the first data set is : df.iloc[0:6].sum[ ] and the second data set is df.iloc[7:13].sum() and so on ). If the date time have a gap, the value of the sum will be wrong, and others are incorrect as well. – Yating Wu Dec 07 '17 at 09:53
  • Your data point will be wrong anyhow, because you will have a data point that consists to 6/7th of pure guessing. It is better to make clear that a data point is missing than to pretend an invented data point reflects reality. Exception would be that you can justify, how to predict the missing values. But look at the `pressure` data points for instance. Will they rise in a linear fashion in the missing week? Will they rise abruptly at the beginning or the end of the missing week? You don't know. – Mr. T Dec 07 '17 at 10:18

2 Answers2

0

Here is a function I wrote that might be helpful to you. It looks for inconsistent jumps in time and fills them in. After using this function, try using a linear interpolation function (pandas has a good one) to fill in your null data values. Note: Numpy arrays are much faster to iterate over and manipulate than Pandas dataframes, which is why I switch between the two.

import numpy as np
import pandas as pd

data_arr = np.array(your_df)
periodicity = 'daily'

def fill_gaps(data_arr, periodicity):
    rows = data_arr.shape[0]
    data_no_gaps = np.copy(data_arr) #avoid altering the thing you're iterating over
    data_no_gaps_idx = 0

    for row_idx in np.arange(1, rows): #iterate once for each row (except the first record; nothing to compare)
        oldtimestamp_str = str(data_arr[row_idx-1, 0]) 
        oldtimestamp = np.datetime64(oldtimestamp_str)  

        currenttimestamp_str = str(data_arr[row_idx, 0])
        currenttimestamp = np.datetime64(currenttimestamp_str)

        period = currenttimestamp - oldtimestamp

        if period != np.timedelta64(900,'s') and period != np.timedelta64(3600,'s') and period != np.timedelta64(86400,'s'):                                
            if periodicity == 'quarterly':
                desired_period = 900
            elif periodicity == 'hourly':
                desired_period = 3600
            elif periodicity == 'daily':
                desired_period = 86400

            periods_missing = int(period / np.timedelta64(desired_period,'s'))
            for missing in np.arange(1, periods_missing):
                new_time_orig = str(oldtimestamp + missing*(np.timedelta64(desired_period,'s')))
                new_time = new_time_orig.replace('T', ' ')
                data_no_gaps = np.insert(data_no_gaps, (data_no_gaps_idx + missing), 
                                 np.array((new_time, np.nan, np.nan, np.nan, np.nan, np.nan)), 0) # INSERT VALUES YOU WANT IN THE NEW ROW

            data_no_gaps_idx += (periods_missing-1) #incriment the index (zero-based => -1) in accordance with added rows

        data_no_gaps_idx += 1 #allow index to change as we iterate over original data array (main for loop)

    #create a dataframe:
    data_arr_no_gaps = pd.DataFrame(data=data_no_gaps, index=None,columns=['Time', 'temp', 'humidity', 'light', 'pressure', 'speed'])

    return data_arr_no_gaps  
Renel Chesak
  • 577
  • 6
  • 16
0

Fill time gaps and nulls

Use the function below to ensure expected date sequence exists, and then use forward fill to fill in nulls.

import pandas as pd
import os

def fill_gaps_and_nulls(df, freq='1D'):
    '''
    General steps:
        A) check for extra dates (out of expected frequency/sequence)
        B) check for missing dates (based on expected frequency/sequence)
        C) use forwardfill to fill nulls
        D) use backwardfill to fill remaining nulls
        E) append to file    
    '''
    
    #rename the timestamp to 'date'
    df.rename(columns={"timestamp": "date"})
    
    #sort to make indexing faster
    df = df.sort_values(by=['date'], inplace=False)
    
    #create an artificial index of dates at frequency = freq, with the same beginning and ending as the original data
    all_dates = pd.date_range(start=df.date.min(), end=df.date.max(), freq=freq)

    #record column names
    df_cols = df.columns

    #delete ffill_df.csv so we can begin anew
    try:
        os.remove('ffill_df.csv')
    except FileNotFoundError:
        pass

    #check for extra dates and/or dates out of order. print warning statement for log
    extra_dates = set(df.date).difference(all_dates)
    
    #if there are extra dates (outside of expected sequence/frequency), deal with them
    if len(extra_dates) > 0:
        #############################
        #INSERT DESIRED BEHAVIOR HERE        
        print('WARNING: Extra date(s):\n\t{}\n\t    Shifting highlighted date(s) back by 1 day'.format(extra_dates))
        for date in extra_dates:
            #shift extra dates back one day
            df.date[df.date == date] = date - pd.Timedelta(days=1)
        #############################

    #check the artificial date index against df to identify missing gaps in time and fill them with nulls
    gaps = all_dates.difference(set(df.date))
    print('\n-------\nWARNING: Missing dates: {}\n-------\n'.format(gaps))

    #if there are time gaps, deal with them
    if len(gaps) > 0:
        #initialize df of correct size, filled with nulls
        gaps_df = pd.DataFrame(index=gaps, columns=df_cols.drop('date')) #len(index) sets number of rows
        #give index a name
        gaps_df.index.name = 'date'
        #add the region and type
        gaps_df.region = r
        gaps_df.type = t
        #remove that index so gaps_df and df are compatible
        gaps_df.reset_index(inplace=True)                
        #append gaps_df to df
        new_df = pd.concat([df, gaps_df])
        #sort on date
        new_df.sort_values(by='date', inplace=True)

    #fill nulls
    new_df.fillna(method='ffill', inplace=True)
    new_df.fillna(method='bfill', inplace=True)

    #append to file
    new_df.to_csv('ffill_df.csv', mode='a', header=False, index=False) 
    
    return df_cols, regions, types, all_dates
Community
  • 1
  • 1
Renel Chesak
  • 577
  • 6
  • 16