4

Say we have a dataframe df with a list of dates in chronological order by day.

The goal is to obtain the number of people on a given day that had a date range that included the given date.

df = pd.DataFrame(data={'date': [datetime.date(2007, 12, 1), 
                                 datetime.date(2007, 12, 2), 
                                 datetime.date(2007, 12, 3)], 
                        'num_people_on_day': [0,0,0]})

dg = pd.DataFrame(data={'person': ['Alice', 'Bob', 'Chuck'],
                        'start': [datetime.date(2007, 11, 5), 
                                  datetime.date(2007, 12, 8), 
                                  datetime.date(2007, 1, 5)],
                        'end': [datetime.date(2007, 12, 6), 
                                datetime.date(2008, 1, 3), 
                                datetime.date(2007, 11, 30)]})

enter image description here

So for each date in df, how can I check all of dg efficiently and then count the number returned and put it into df.

I am not even sure if a merge is necessary here (also trying to save memory), and am really trying to write it to be as fast as possible.

EDIT: Ok so I have come up with a different way to do this, but I hate using apply. Is there a way to do this new way without using .apply?

import pandas as pd
import datetime

df = pd.DataFrame(data={'date': [datetime.date(2007, 12, 1), 
                                 datetime.date(2007, 12, 2), 
                                 datetime.date(2007, 12, 3)]})

dg = pd.DataFrame(data={'person': ['Alice', 'Bob', 'Chuck', 'Dave'],
                        'start': [datetime.date(2007, 11, 5), 
                                  datetime.date(2007, 12, 8), 
                                  datetime.date(2007, 1, 5),
                                  datetime.date(2007, 11, 6)],
                        'end': [datetime.date(2007, 12, 1), 
                                datetime.date(2008, 1, 3), 
                                datetime.date(2007, 11, 30),
                                datetime.date(2007, 12, 2)]})

def get_num_persons(date, vec_starts, vec_ends):
    """
    Helper function for .apply to get number of persons.
    For each given date, if start and end date is 
    between the given date, then both results are True.
    The bitwise AND then only sums these True and True values.
    """
    return (((vec_starts <= date) & (vec_ends >= date)).sum())

def num_of_persons(persons, dates_df):
    """
    Obtains the number of persons for each day.
    """
    dates_df['num_persons'] = dates_df['date'].apply(lambda row: 
                                                   get_num_persons(row, 
                                                   persons['start'],
                                                   persons['end']))
    return dates_df

num_of_persons(dg, df.copy())
bbd108
  • 958
  • 2
  • 10
  • 26
  • https://stackoverflow.com/questions/46525786/how-to-join-two-dataframes-for-which-column-values-are-within-a-certain-range or perhaps https://stackoverflow.com/questions/51755268/pandas-merge-on-datetime-or-datetime-in-datetimeindex or https://stackoverflow.com/questions/44367672/best-way-to-join-merge-by-range-in-pandas. Then just a groupby and size to get the counts. – ALollz Jun 20 '19 at 14:43
  • Interesting but not really. The last 2 links are not useful. The first one maybe, but I'm not classifying here. Like I said I don't even think I need to merge. I'll try and fix the first link to my case – bbd108 Jun 20 '19 at 14:52
  • The last two links are entirely useful: `m = df.assign(k=1).merge(dg.assign(k=1))` followed by `m[m.date.between(m.start, m.end)].groupby('date').size()` solves your problem 100% (maybe add a `.reindex(df.date).fillna(0)`, but that's hardly the focal point of the question) – ALollz Jun 20 '19 at 14:58
  • You are correct in that it works, however for me the merge makes too many duplicates and as such I get a MemoryError (appx 50k rows for 1200 days -> 60m rows). Need to investigate a different approach thanks though. – bbd108 Jun 20 '19 at 15:29

2 Answers2

2

With sufficient memory, merge then count the dates that fall in between. .reindex ensures we get the 0s.

#df['date'] = pd.to_datetime(df.date)
#dg['start'] = pd.to_datetime(dg.start)
#dg['end'] = pd.to_datetime(dg.end)

m = df[['date']].assign(k=1).merge(dg.assign(k=1))

(m[m.date.between(m.start, m.end)].groupby('date').size()
   .reindex(df.date).fillna(0)
   .rename('num_people_on_day').reset_index())

         date  num_people_on_day
0  2007-12-01                  1
1  2007-12-02                  1
2  2007-12-03                  1

The other option is to use the apply. This is a loop, so performance suffers as df grows.

def get_len(x, dg):
    try:
        return len(dg.iloc[dg.index.get_loc(x)])
    except KeyError:  # Deal with dates that have 0
        return 0

dg.index = pd.IntervalIndex.from_arrays(dg['start'], dg['end'], closed='both')
df['num_people_on_day'] = df['date'].apply(get_len, dg=dg)

To illustrate the timings, look at your small set, then a much larger df.

%%timeit 
m = df[['date']].assign(k=1).merge(dg.assign(k=1))
(m[m.date.between(m.start, m.end)].groupby('date').size()
   .reindex(df.date).fillna(0)
   .rename('num_people_on_day').reset_index())
#9.39 ms ± 52 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit 
dg.index = pd.IntervalIndex.from_arrays(dg['start'], dg['end'], closed='both')
df['num_people_on_day'] = df['date'].apply(get_len, dg=dg)
#4.06 ms ± 27.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

But once df is longer (even just 90 rows), the difference becomes apparent.

df = pd.DataFrame({'date': pd.date_range('2007-01-01', '2007-03-31')})

%%timeit merge
#9.78 ms ± 75.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit IntervalIndex
#65.5 ms ± 418 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
ALollz
  • 57,915
  • 7
  • 66
  • 89
  • 1
    Agreed that the computation is efficient. However it is highly memory inefficient because of the merge. – bbd108 Jun 20 '19 at 15:46
  • 1
    @bbd108 yes I agree. There always seems to be the trade-off between time and memory with these types of calculations. – ALollz Jun 20 '19 at 15:47
0
data_df = pd.DataFrame()

def adding_dates(x):

    dates = list(pd.date_range(x['start'],x['end']))
    data= pd.DataFrame({'date':dates})
    data['name'] = x['person']
    global data_df
    data_df = data_df.append(data)

dg.apply(lambda x: adding_dates(x),axis=1)

data_df['date'] = pd.to_datetime(data_df['date']).dt.date
df['date'] = pd.to_datetime(df['date']).dt.date
data_df = data_df.groupby(['date'],as_index=False)[['name']].count().rename(columns={'name':'count'})

final_df = pd.merge(df[['date']],res,on=['date'],how='left')
print(final_df)


tawab_shakeel
  • 3,701
  • 10
  • 26