1

I have a data frame that looks like the following:

Date Time Entry Exist 2013-01-07 05:00:00 29.0 12.0 2013-01-07 10:00:00 98.0 83.0 2013-01-07 15:00:00 404.0 131.0 2013-01-07 20:00:00 2340.0 229.0 2013-01-08 05:00:00 3443.0 629.0 2013-01-08 10:00:00 6713.0 1629.0 2013-01-08 15:00:00 9547.0 2965.0 2013-01-08 20:00:00 10440.0 4589.0

I would like to convert it and normalize so that it shows the hourly consumption over time.

DateTime Entry Exist 2013-01-07 00:00:00 2.0 1.0 2013-01-07 01:00:00 9.0 4.0 2013-01-07 02:00:00 16.0 6.0 2013-01-07 03:00:00 23.0 9.0 2013-01-07 04:00:00 26.0 10.0 2013-01-07 05:00:00 29.0 12.0 2013-01-07 06:00:00 37.0 19.0 2013-01-07 07:00:00 56.0 32.0 2013-01-07 08:00:00 62.0 57.0 2013-01-07 09:00:00 77.0 63.0 2013-01-07 10:00:00 98.0 83.0 2013-01-07 11:00:00 104.0 95.0 .......

I would like to first concatenate the date and time into one single column as DateTime and then achieve the above result.

New to python, any help would be much appreciated. Thank you.

Barry Lucky
  • 49
  • 1
  • 9
  • From your example, there is no information to fill the missing hours. e.g. how do you know the entry and exist at 2013-01-07 00:00:00? – Jingshao Chen Jul 18 '18 at 00:57
  • I don't understand what changed about your `Entry` and `Exist` columns – user3483203 Jul 18 '18 at 00:58
  • I think I was unclear about my initial request, the example I gave for the output contains random continuous numbers. If you see the 05 and 10 hours have the same values for Entry and Exist for input and output. Whereas when we split the DateTime into more intervals, those values would be NaN, so I would need to interpolate that using any of the methods. The Entry values for 1 to 4 should be in the range of 0 and 29 and the Exist in the range of 0 and 12. To predict these values of Entry and Exist for 1 to 4, 6 to 9, 11 to 14, etc I would need the help of Interpolation. – Barry Lucky Jul 18 '18 at 01:07

1 Answers1

2

The quick answer is that you can use

DataFrame.resample().mean().interpolate() 

To at least do the interpolation part of your post.

Note that your post includes "out of domain" extrapolation, as you are predicting outside the domain of the input data. i.e the time series starts at 5:00 AM on 1/7, however your oversampled data starts 5 hours earlier. Interpolation is only a within domain method, but I suspect that is what you wanted.

Here's the steps for interpolating.

First, it helps if you can post a self contained example with code that either generates data for testing, or has some way to reproduce it.

Reference these two excellent posts:

Combine Date and Time columns using python pandas

How to create a Pandas DataFrame from a string

Here's how I did it:

import pandas as pd
from io import StringIO
from bokeh.plotting import figure, output_notebook, show

# copied and pasted from your post :)
data = StringIO("""
Date             Time         Entry       Exist
2013-01-07      05:00:00        29.0       12.0
2013-01-07      10:00:00        98.0       83.0
2013-01-07      15:00:00       404.0      131.0
2013-01-07      20:00:00      2340.0      229.0
2013-01-08      05:00:00      3443.0      629.0
2013-01-08      10:00:00      6713.0      1629.0
2013-01-08      15:00:00      9547.0      2965.0
2013-01-08      20:00:00     10440.0      4589.0""")

# read in the data,  converting the separate date and times to a single date time.
# see the link to do this "after the fact" if your data has separate date and time columns

df = pd.read_csv(data, 
    parse_dates={"date_time": ['Date', 'Time']}, 
    delim_whitespace=True)

Now, make the data a time series, resample it, apply a function (mean in this case) and interpolate both data columns at the same time.

df_rs = df.set_index('date_time').resample('H').mean().interpolate('linear')
df_rs

Which looks like this:

enter image description here

Those values don't look exactly like the ones in your post, but it was not clear what sort of interpolation was being used. Linear, Cubic? Other?

So for fun, let's plot the data with bokeh. Large red dots are the orig data, while the blue dots (and connecting lines) are the interpolated data.

output_notebook()

p = figure(x_axis_type="datetime", width=800, height=500)

p.title.text = "Entry vs. Date Time (cubic interpolated to 1H)"
p.xaxis.axis_label = 'Date Time (cubic interpolated to 1H)'
p.yaxis.axis_label = 'Entry'

# orig data
p.circle(df['date_time'], df['Entry'], color='red', size=10)

# oversampled data
p.circle(df_rs.index, df_rs['Entry'])
p.line(df_rs.index, df_rs['Entry'])

show(p)

Which looks like this:

enter image description here

Or with cubic interpolation, you get a bit more smoothing:

enter image description here

FULL CODE

import pandas as pd
from io import StringIO
from bokeh.plotting import figure, output_notebook, show

output_notebook()

# copied and pasted from your post :)
data = StringIO("""
Date            Time        ENTRIES       EXITS
2013-01-07      05:00:00        29.0       12.0
2013-01-07      10:00:00        98.0       83.0
2013-01-07      15:00:00       404.0      131.0
2013-01-07      20:00:00      2340.0      229.0
2013-01-08      05:00:00      3443.0      629.0
2013-01-08      10:00:00      6713.0      1629.0
2013-01-08      15:00:00      9547.0      2965.0
2013-01-08      20:00:00     10440.0      4589.0""")

# read in the data,  converting the separate date and times to a single date time.
# see the link to do this "after the fact" if your data as separate date and time columns
original_data = pd.read_csv(data, 
    parse_dates={"DATETIME": ['Date', 'Time']}, 
    delim_whitespace=True)

# make it a time series, resample to a higher freq, apply mean, interpolate and round
inter_data = original_data.set_index(['DATETIME']).resample('H').mean().interpolate('linear').round(1) 

# No need to drop the index to select a slice.  You can slice on the index
# I see you are starting at 1/1 (jan 1st),  yet your data starts at 1/7 (Jan 7th?)
inter_data[inter_data.index >= '2013-01-01 00:00:00'].head(20) 

enter image description here

Randall Goodwin
  • 1,916
  • 2
  • 18
  • 34
  • The Code I used is: `inter_data = original_data.set_index(['DATETIME']).resample('H').mean().interpolate('linear') inter_data.reset_index(inplace=True) inter_data.ENTRIES=inter_data.ENTRIES.round(1) inter_data.EXITS=inter_data.EXITS.round(1) inter_data[inter_data.DATETIME >= '2013-01-01 00:00:00'].head(20)` – Barry Lucky Jul 18 '18 at 15:04
  • I tried to implement the same to my data frame, but the Entry and Exist values were not continuous. Here is what I got: `index DATETIME ENTRIES EXITS 72 72 2013-01-01 00:00:00 4234935.6 2175034.2 73 73 2013-01-01 01:00:00 3249043.7 2697696.5 74 74 2013-01-01 02:00:00 1404653.5 828730.7 75 75 2013-01-01 03:00:00 3959076.0 2685191.8 76 76 2013-01-01 04:00:00 4397057.8 2409161.3 77 77 2013-01-01 05:00:00 2683292.3 2182695.4 78 78 2013-01-01 06:00:00 1168712.3 687371.6 79 79 2013-01-01 07:00:00 3969078.2 2700993.6 ` – Barry Lucky Jul 18 '18 at 15:07
  • The values for Entry and Exist are not continuous. – Barry Lucky Jul 18 '18 at 15:20
  • I'm not sure what you mean by not continuous. The resulting time series is a regular time series of 1 hour increments. The values at the non-sampled hours are linear interpolations between the sampled times. Note too, you can use .round(1) right after the .interpolate() instead of calling it twice on each column. Please up-vote my answer if it was helpful :). – Randall Goodwin Jul 18 '18 at 19:54
  • The time series is good. I am talking about the values at the non-sampled hours. They are not in the expected range. Example: If 05:00:00 the value is 100 and in 10:00:00 it is 500. The values for 02,03,04 should lie between 100 and 500, and value at 02 < value at 03 and value at 03 < value at 04. But when I apply the same linear interpolation, the behavior changes, and it goes above 500. Is there a way to force the interpolated values to be in the given range, that follows the above rule? Also, thanks for the update on the round(1). I have made the correction in my code. – Barry Lucky Jul 18 '18 at 21:51
  • I am at a loss. Only thing I see is that you are looking at the data from "2013-01-01 00:00:00" (January 1st), but the data in the post starts January 7th. I tacked on full code at the end. Copy and past it and try that. Upvote my answer if you think it has been helpful :) – Randall Goodwin Jul 19 '18 at 02:09
  • Hello there @RandallGoodwin is there any way to plot the above data using matplotlib instead of bokeh – Devarshi Goswami Feb 11 '20 at 07:01