1

I have a dataset that holds weather data for each month from 1st day to 20th of month and for each hour of the day throw a year and the last 10 days(with it's hours) of each month are removed.

The weather data are : (temperature - humidity - wind_speed - visibility - dew_temperature - solar_radiation - rainfall -snowfall)

I want to upsample the dataset as time series to fill the missing data of the days but i face many issue due too the changes of climate.

Here it what is tried so far

def get_hour_month_mean(data,date,hour,max_id):
    return { 'ID':max_id,
            
            'temperature':data['temperature'].mean(),
                'humidity':data['humidity'].mean(),
                'date':date,
                'hour':hour,
                'wind_speed':data['wind_speed'].mean(),
                'visibility':data['visibility'].mean(),
                'dew_temperature':data['dew_temperature'].mean(),
                'solar_radiation':data['solar_radiation'].mean(),
                'rainfall':data['rainfall'].mean(),
                'count':data['count'].mean() if str(date.date()) not in seoul_not_func else 0,
                'snowfall':data['snowfall'].mean(),
                'season':data['season'].mode()[0],
                'is_holiday':'No Holiday' if str(date.date()) not in seoul_p_holidays_17_18 else 'Holiday' ,
                'functional_day':'Yes' if str(date.date()) not in seoul_not_func else 'No' ,
            }

def upsample_data_with_missing_dates(data):
    data_range = pd.date_range(
    start="2017-12-20", end="2018-11-30", freq='D')
    missing_range=data_range.difference(df['date'])
    hour_range=range(0,24)
    max_id=data['ID'].max()
    data_copy=data.copy()
    for date in missing_range:
        for hour in hour_range:
            max_id+=1
            year=data_copy.year
            month=date.month
            if date.month==11:
                year-=1
                month=12
            else:
                month+=1
            month_mask=((data_copy['year'] == year) &
                        (data_copy['month'] == month) &
                        (data_copy['hour'] == hour) &(data_copy['day'].isin([1,2])))
            data_filter=data_copy[month_mask]
            dict_row=get_hour_month_mean(data_filter,date,hour,max_id)
            data = data.append(dict_row, ignore_index=True)
    return data

any ideas what is the best way to get the values of the missing days if i have the previous 20 days and the next 20 days ?

Mostafa Mohamed
  • 816
  • 15
  • 39

1 Answers1

1

There is a lot of manners to deal with missing timeseries values in fact.

You already tried the traditional way, imputing data with mean values. But the drawback of this method is the bias caused by so many values on the data.

You can try a genetic algorithm (GA), Support Vector Machine(SVR), autoregressive(AR) and moving average(MA) for time series imputation and modeling. To overcome the bias problem caused by the tradional method (mean), these methods are used to forecast or/and impute time series.

(Consider that you have a multivariate timeseries)

Here are some ressources you can use :

A Survey on Deep Learning Approaches

time.series.missing-values-in-time-series-in-python

Interpolation in Python to fill Missing Values