1

I am trying to add a column to my data frame in pandas where each entry represents the difference between another column's values across two adjacent rows (if certain conditions are met). Following this answer to get previous row's value and calculate new column pandas python I'm using shift to find the delta between the duration_seconds column entries in the two rows (next minus current) and then return that delta as the derived entry if both rows are from the same user_id, the next row's action is not 'login', and the delta is not negative. Here's the code:

def duration (row):
    candidate_duration = row['duration_seconds'].shift(-1) - row['duration_seconds']
    if row['user_id'] == row['user_id'].shift(-1) and row['action'].shift(-1) != 'login' and candidate_duration >= 0:
        return candidate_duration
    else:
        return np.nan

Then I test the function using

analytic_events.apply(lambda row: duration(row), axis = 1)

But that throws an error:

AttributeError: ("'int' object has no attribute 'shift'", 'occurred at index 9464384')

I wondered if this was akin to the error fixed here and so I tried passing in the whole data frame thus:

duration(analytic_events)

but that throws the error:

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

What should I be doing to achieve this combination; how should I be using shift?

dumbledad
  • 16,305
  • 23
  • 120
  • 273
  • 1
    Is it possible add an example dataframe and expected output? – Erfan Mar 20 '19 at 16:19
  • I was hoping it would be self explanatory without that @Erfan, it is probably a stupid mistake I am making. But I can try (the real data is enormous and private) – dumbledad Mar 20 '19 at 16:20
  • 1
    samples data always help understanding the statements. :) – anky Mar 20 '19 at 16:56

1 Answers1

2

Without seeing your data. You could simplify this with using conditionally creation of columns with np.where:

cond1 = analytic_events['user_id'] == analytic_events['user_id'].shift(-1)   
cond2 = analytic_events['action'].shift(-1) != 'login'
cond3 = analytic_events['duration_seconds'].shift(-1) - analytic_events['duration_seconds'] >= 0

analytic_events['candidate_duration'] = np.where((cond1) & (cond2) & (cond3), 
                                                 analytic_events['duration_seconds'].shift(-1) - analytic_events['duration_seconds'], 
                                                 np.NaN)

explanation np.where works as following: np.where(condition, value if true, value is false)

Community
  • 1
  • 1
Erfan
  • 40,971
  • 8
  • 66
  • 78
  • Hmmm. When you say "simplify" you could have said "fix". I'm not sure why yours works and my original doesn't. (N.B. My take away—again—is don't use `apply`!) – dumbledad Mar 20 '19 at 17:41