Run multiple groupby operations and different transform functions on a data set

Question

I have the below data set which is a reading of values every 5 seconds. I need to do two operations on the data set.

Calculate average value for every minute from the data set
Using the above minute average values, calculate hourly variation (i.e difference every minute values and sum total)

What would be the best way to achieve this?

2018-02-10 17:25:49.074206,340
2018-02-10 17:25:54.078155,340
2018-02-10 17:25:59.081041,340
2018-02-10 17:26:04.085504,340
2018-02-10 17:26:09.089500,340
2018-02-10 17:26:14.092926,340
2018-02-10 17:26:19.097002,340
2018-02-10 17:26:24.101067,340
2018-02-10 17:26:29.104451,340
2018-02-10 17:26:34.108283,340
2018-02-10 17:26:39.112641,340
2018-02-10 17:26:44.115325,340
2018-02-10 17:26:49.120067,340
2018-02-10 17:26:54.124166,340
2018-02-10 17:26:59.127224,340

I have looked at various posts of stackoverflow have the below less than optimal code, still norm_by_data1 has errors

import pandas as pd
from pandas import read_csv
from pandas import datetime
from matplotlib import pyplot

def parser(x):
        return datetime.strptime(x, '%Y-%m-%d %H:%M:%S.%f').strftime('%Y-%m-%d %H:%M')

def parser1(x):
        return datetime.strptime(x, '%Y-%m-%d %H:%M:%S').strftime('%Y-%m-%d %H')

def norm_by_data(x):
        return x.mean()

prevrow = None
total = None

def norm_by_data1(x):
        for row in x:
           total += row - prevrow
           prevrow = row

series = read_csv('water_data.txt', header=0, parse_dates=[0], index_col=0, squeeze=True, date_parser=parser)
#print(series.head())
series.groupby(level=0).apply(norm_by_data).to_csv("tmp")

series1 = read_csv('tmp', header=0, parse_dates=[0], index_col=0, squeeze=True, date_parser=parser1)
series1.groupby(level=0).apply(norm_by_data1)

Sidenote: statistically, calculating the variance of a (moving) average results in a significantly lower variance. Depending on what you want to express, it might be (from a statistical viewpoint) better to calculate the variance based on the 5-second data. — agtoever, Feb 24 '18 at 13:22
Welcome to StackOverflow. Please take the time to read this post on [how to provide a great pandas example](http://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) as well as how to provide a [minimal, complete, and verifiable example](http://stackoverflow.com/help/mcve) and revise your question accordingly. These tips on [how to ask a good question](http://stackoverflow.com/help/how-to-ask) may also be useful. — jezrael, Feb 24 '18 at 13:41

joaoavf · Accepted Answer · 2018-02-24T16:36:43.467

0

0) To get a datetime index from a .csv file, you can do something like this:

df = pd.read_csv('water_data.txt', parse_dates=[0], index_col=0)

parse_dates=[0] will parse dates for column in position 0 and index_col=0 will make column in position 0 the DataFrame index.

1) you need to set a datetime index and run the code below: (if you don't have a datetime index, let me know and I'll show you how to do it)

df.resample('1Min').mean()

2) You will also need a datetime index to do this.

# Gets mean for every minute
ndf = df.resample('1Min').mean()

# Calculate difference from mean in actual minute from previous minute
ndf['diff'] = ndf['values'].diff(periods=1) # You might need to chain here a .abs() as well

# Produces sum of differences for a given hour
ndf['diff'].resample('1H').sum()

3) This should work to aggregate a different function for negative and positive numbers:

# It will throw an error if 'func()' is not defined
ndf['diff'].resample('1H').agg({'neg': [lambda x: x[x < 0].func()], 'pos': [lambda x: x[x > 0].sum()]})

edited Feb 24 '18 at 16:36

answered Feb 24 '18 at 14:45

joaoavf

1,343
1
12
25

1

>> if you don't have a datetime index, let me know and I'll show you how to do it Yes please, new to python – Hemanth V Feb 24 '18 at 15:45
@HemanthV, I've edited my post and added code that will do that, can you check it out and see if it solves your problem? – joaoavf Feb 24 '18 at 16:04
Yes works great thanks. Only additional thing I need is to do the sum only if the difference is positive, if negative run a different function. Any ideas – Hemanth V Feb 24 '18 at 16:18
1

a custom function, for now just say "func" – Hemanth V Feb 24 '18 at 16:24
in case I need to read the data dynamically and create a dataframe, rather than a csv file. Would resample function still work – Hemanth V Feb 28 '18 at 17:46

Run multiple groupby operations and different transform functions on a data set

1 Answers1