Named aggregations with multiple columns

Question

I'm trying to apply the following two imputs function in a aggregation statement, but I'm getting an unhashable type: 'list' TypeError:

from datetime import datetime

def process_difftime(x, y, start, final):
  t1 = []
  t2 = []
  for i in x.index:
    if x[i] == start:
      t1.append(y[i])
    elif x[i] == final:
      t2.append(y[i])
  res = round((max(t2) - max(t1)).total_seconds()/3600, 2)
  return res

List0 = pd.Series(['A10000','A10000','A10001','A10001'], index=[2,3,4,5])
List1 = pd.Series(['A_Create','A_Accepted','A_Create','A_Accepted'], index=[2,3,4,5])
List2 = pd.Series(['2016-08-03 15:57:21','2016-08-03 16:57:21','2016-08-03 15:57:21','2016-08-03 19:57:21'], index=[2,3,4,5])
List2 = pd.Series([datetime.strptime(x,'%Y-%m-%d %H:%M:%S') for x in List2], index=[2,3,4,5])

df = pd.DataFrame({
    'code':List0,
    'instance':List1,
    'timestamp':List2
})

df.groupby(['code']) \
  .agg(
      a_concept_difftime = (['instance','timestamp'], lambda x,y: process_difftime(x,y,'A_Create','A_Accepted'))
  )

Any suggestion?

Desired output

code    a_concept_difftime
A10000  1.0
A10000  4.0

Additional details: I'm working with a large log events dataset that corresponds to the execution of a semi standardized process, there is about 60 different instances (stages of the process) and 3 different timestamps (schedule, start, complete). The goal of the function is to select a instance column and a timestamp type to calculate the difference in hours between two instances (the combination could change).

Also, expected output for the sample would help in trying to understand the goal here — cs95, Feb 14 '21 at 22:41
@DavidM perhaps expand your example to show us why this answer would not work, and what the expected output would be in that case. Thanks. — cs95, Feb 14 '21 at 22:56
@cs95 I added some details, wish you could help me run the function — David M, Feb 14 '21 at 23:16

David M · Answer 1 · 2021-02-15T01:30:32.107

After a couple of hours looking for a solution I found this contribution that satisfies my problem.

df.groupby('code') \
  .apply(lambda x: pd.Series({
      'a_accepted_time':process_difftime(x['instance'], x['timestamp'], 'A_Create', 'A_Concept')
  }))

I also found that Tuple Named Aggregations does not work with multiple columns, as mentioned in this post https://github.com/pandas-dev/pandas/issues/29268.

Thanks to @r2evans for the contribution. https://stackoverflow.com/a/53096340/12514619

Named aggregations with multiple columns

1 Answers1