First of all, apologies if this has been solved somewhere. I have gone through these posts a, b and they do not seem to address my problem.
For several objects, I want to calculate parameter differences in time, and sum them up.
synthetic data:
import pandas as pd
import numpy as np
np.random.seed(23)
dict_1 = {
'time': range(100),
'object': [1]*100,
'parameter_1':np.random.randint(-10,10,100),
'parameter_2':np.random.randint(-10,10,100)}
dict_2 = {
'time': range(100),
'object': [2]*100,
'parameter_1':np.random.randint(-120,-110,100),
'parameter_2':np.random.randint(-10,10,100)}
df1 = pd.DataFrame(dict_1)
df2 = pd.DataFrame(dict_2)
df = df1.append(df2)
approach:
using groupby
, agg
, and diff
:
for df1:
diffs = df1.groupby(['object', 'time']).agg({'parameter_1': np.mean, 'parameter_2': np.mean}).diff()
diffs['total_variation'] = abs(diffs['parameter_1']) + abs(diffs['parameter_2'])
np.max(diffs['total_variation'])
output: 35
for df2:
diffs = df2.groupby(['object', 'time']).agg({'parameter_1': np.mean, 'parameter_2': np.mean}).diff()
diffs['total_variation'] = abs(diffs['parameter_1']) + abs(diffs['parameter_2'])
np.max(diffs['total_variation'])
output: 24
for the combined data:
diffs = df.groupby(['object', 'time']).agg({'parameter_1': np.mean, 'parameter_2': np.mean}).diff()
diffs['total_variation'] = abs(diffs['parameter_1']) + abs(diffs['parameter_2'])
np.max(diffs['total_variation'])
expected result:
35
actual result:
129
So, it seems that diff()
ignores the grouping.
I could write a solution for this that involves for-loops, but since my data is on the large side, I would rather avoid that. Is there a way to fix this, or am I misusing the functions?
Help is greatly appreciated!