How to compare an argument to separate columns in each row of a Pandas dataframe?

Question

I have a DataFrame dwells with Event ID, Start Time, and End Time:

In []: dwells[['Event ID','Start Time','Stop Time']].head()
Out[]: 
    Event ID          Start Time           Stop Time
0  367067960 2016-09-01 00:05:00 2016-10-05 14:00:00
1  311288000 2016-09-01 00:05:00 2016-09-01 23:30:00
2  636016999 2016-09-01 00:05:00 2016-09-01 01:50:00
3  247304600 2016-09-01 01:20:00 2016-09-01 21:25:00
4  636016590 2016-09-01 06:55:00 2016-09-01 23:35:00

In []: dwells[['Event ID','Start Time','Stop Time']].dtypes
Out[]: 
Event ID               int64
Start Time    datetime64[ns]
Stop Time     datetime64[ns]
dtype: object

I'm trying to determine the number of events that were occurring at every increment of a DateTimeIndex with 5 minute frequency. Eventually I want to know the number of events and the cumulative time that the events were occurring (by multiplying the number of events and the time step and taking the cumsum):

In []: start = datetime(2016,9,1)
  ...: end = datetime(2016,12,31)
  ...: rng = pd.date_range(start, end, freq='5min')
  ...: rng[:5]

Out[]: 
DatetimeIndex(['2016-09-01 00:00:00', '2016-09-01 00:05:00',
               '2016-09-01 00:10:00', '2016-09-01 00:15:00',
               '2016-09-01 00:20:00'],
              dtype='datetime64[ns]', freq='5T')

I want to loop over the DateTimeIndex and compare each entry to the Start Time and Stop Time to see if it is between them, setting an appropriate variable in a new FLAG field. I can then sum the FLAG field and set it as the value of a series with rng as the index, like:

series = pd.Series(index=rng)
for x in rng:
    dwells['FLAG'] = dwells[['Start Time', 'Stop Time']].apply(lambda i,j: 1 if i.value <= x.value <= j.value else 0)
    series.loc[x] = dwells['FLAG'].sum()

That apply function doesn't work. I haven't been able to come up with a function that lets me check the x value against the time range in every row.

I'd appreciate help defining a function that gives me an output like:

In []: series[:5]
Out[]:
2016-09-01 00:00:00   37
2016-09-01 00:05:00   39
2016-09-01 00:10:00   40
2016-09-01 00:15:00   39
2016-09-01 00:20:00   35

If there's a more efficient approach to solving this problem I'd appreciate that as well.

score 0 · Answer 1 · edited May 23 '17 at 12:17

I found a good starting point at this post: python pandas: apply a function with arguments to a series

Which led me to the documentation on defining a custom function with keyword arguments applied to series, here: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.apply.html#pandas.Series.apply

Since a DF row is a series, I wrote the following:

def flag_events(row, **kwargs):
    '''Applied row-wise to a DF, checks if kwargs['t_step'] is between 'Start Time' and 'Stop Time', returning 1 if yes and 0 if no'''
    if row['Start Time'].value <= kwargs['t_step'] <= row['Stop Time'].value:
         return 1
    else:
        return 0

DwellTable = pd.DataFrame(index=rng)

DwellTable['VesselCount'] = DwellTable.index.map(lambda x: dwells.apply(flag_events, t_step=x.value, axis=1).sum())

DwellTable['DwellMin'] = DwellTable['EventCount']*5
DwellTable['DwellMinCum'] = DwellTable['DwellMin'].cumsum()

That worked, but it takes a long time to run. I'd still appreciate suggestions for a more efficient approach.

How to compare an argument to separate columns in each row of a Pandas dataframe?

1 Answers1