Suppose I have several records for each person, each with a certain date. I want to construct a column that indicates, per person, the number of other records that are less than 2 months old. That is, I focus just on the records of, say, individual 'A', and I loop over his/her records to see whether there are other records of individual 'A' that are less than two months old (compared to the current row/record).
Let's see some test data to make it clearer
import pandas as pd
testdf = pd.DataFrame({
'id_indiv': [1, 1, 1, 2, 2, 2],
'id_record': [12, 13, 14, 19, 20, 23],
'date': ['2017-04-28', '2017-04-05', '2017-08-05',
'2016-02-01', '2016-02-05', '2017-10-05'] })
testdf.date = pd.to_datetime(testdf.date)
I'll add the expected column of counts
testdf['expected_counts'] = [1, 0, 0, 0, 1, 0]
#Gives:
date id_indiv id_record expected
0 2017-04-28 1 12 1
1 2017-04-05 1 13 0
2 2017-08-05 1 14 0
3 2016-02-01 2 19 0
4 2016-02-05 2 20 1
5 2017-10-05 2 23 0
My first thought was to group by id_indiv
then use apply
or transform
with custom function. To make things easier, I'll first add a variable that substracts two months from the record date and then I'll write the count_months
custom function for the apply
or transform
testdf['2M_before'] = testdf['date'] - pd.Timedelta('{0}D'.format(30*2))
def count_months(chunk, month_var='2M_before'):
counts = np.empty(len(chunk))
for i, (ind, row) in enumerate(chunk.iterrows()):
#Count records earlier than two months old
#but not newer than the current one
counts[i] = ((chunk.date > row[month_var])
& (chunk.date < row.date)).sum()
return counts
I tried first with transform
:
testdf.groupby('id_indiv').transform(count_months)
but it gives an AttributeError: ("'Series' object has no attribute 'iterrows'", 'occurred at index date')
which I guess means that transform
passes a Series
object to the custom function, but I don't know how to fix that.
Then I tried with apply
testdf.groupby('id_indiv').apply(count_months)
#Gives
id_indiv
1 [1.0, 0.0, 0.0]
2 [0.0, 1.0, 0.0]
dtype: object
This almost works, but it gives the result as a list. To "unstack" that list, I followed an answer on this question:
#First sort, just in case the order gets messed up when pasting back:
testdf = testdf.sort_values(['id_indiv', 'id_record'])
counts = (testdf.groupby('id_indiv').apply(count_months)
.apply(pd.Series).stack()
.reset_index(level=1, drop=True))
#Now create the new column
testdf.set_index('id_indiv', inplace=True)
testdf['mycount'] = counts.astype('int')
assert (testdf.expected == testdf.mycount).all()
#df looks now likes this
date id_record expected 2M_before mycount
id_indiv
1 2017-04-28 12 1 2017-02-27 1
1 2017-04-05 13 0 2017-02-04 0
1 2017-08-05 14 0 2017-06-06 0
2 2016-02-01 19 0 2015-12-03 0
2 2016-02-05 20 1 2015-12-07 1
2 2017-10-05 23 0 2017-08-06 0
This seems to work, but it seems like there should be a much easier way (maybe using transform
?). Besides, pasting back the column like that doesn't seem very robust.
Thanks for your time!