I am attempting to reference this SO post where the first answer someone creates some code to analyze the time series data of rain fall and totalize in hours for each rain event. I want to do something similar except for totalizing how often a pump runs (event) in hours each day from time series data. There is no need on my end for totalizing anything else just the duration of the event.
#read CSV file
df = pd.read_csv('C:\\Users\\desktop\\data.csv', index_col='Date', parse_dates=True)
# Converting the index as date
df.index = pd.to_datetime(df.index)
df
My data looks like this:
ChWaterPrs
Date
1/0/00 12:45 AM 0.0
1/0/00 12:50 AM 0.0
1/0/00 12:55 AM 0.0
1/0/00 12:00 AM 0.0
1/0/00 1:05 AM 0.0
I know that the pump is running everytime the df['ChWaterPrs']
is greater than 5, so I am attempting to build these helper colums as mentioned in the SO post:
# create helper columns, a block would be anytime value is > 5
df['block'] = df['ChWaterPrs'] >= 5
df['day'] = df.index.normalize()
# group by day to get unique block count and value count
session_map = df[df['ChWaterPrs'].astype(bool)].groupby('day')['block'].nunique()
hour_map = df[df['ChWaterPrs'].astype(bool)].groupby('day')['ChWaterPrs'].count()
# map to original dataframe
df['sessions'] = df['day'].map(session_map)
df['hours'] = df['day'].map(hour_map)
# calculate result
res = df.groupby(['day', 'hours', 'sessions'], as_index=False)['ChWaterPrs'].sum()
res['duration'] = res['hours'] / res['sessions']
res['amount'] = res['ChWaterPrs'] / res['sessions']
But if I print res
something is screwed up... The durations don't make sense that they are greater than 24 hours in a day. Basically the only information I would love to have is a resample
of the time series data for how many hours
that the pump ran which is determined by the df['ChWaterPrs']
being greater than 5. I think the data would show everyday (a session) where the df['ChWaterPrs']
would be zero else greater than 5 when running. Any tips help thank you.