A complicated operation on a pandas DataFrame

Question

I would like to obtain, for each id_1, the number of common id_2s between the months of December and January. The example below:

id_1	id_2	Date
12	1	20221216
12	1	20230113
12	1	20230116
12	2	20221213
12	2	20230118
18	7	20221207
18	7	20220907
18	7	20230113
18	5	20230118

should return :

id_1	Nb
12	2
18	1

I have looked for a short and optimal way to do this but I haven't found it. I would like to avoid doing several merges. Would you have ideas?

Why is for `18` in ouput `1` ? – jezrael Feb 27 '23 at 13:48 — jezrael, Feb 27 '23 at 13:48

Corralien · Answer 1 · 2023-02-27T13:57:03.040

1

You have to filter your dataframe before counting values:

m = df['Date'].astype(int).between(20221201, 20230131)
counts = df[m].groupby('id_1', as_index=False)['id_2'].nunique()

Or using DatetimeIndex:

m = pd.to_datetime(df['Date'], format='%Y%m%d').between('2022-12-01', '2023-01-31')
counts = df[m].groupby('id_1', as_index=False)['id_2'].nunique()

Output:

>>> counts
   id_1  id_2
0    12     2
1    18     2

Suggested by @mozway:

counts = df[m].groupby('id_1', as_index=False).agg(Nb=('id_2', 'nunique'))

edited Feb 27 '23 at 13:57

answered Feb 27 '23 at 13:45

Corralien

109,409
8
28
52

If you want to rename `id_2` column, use `counts = counts.rename(columns={'id_2': 'Nb'})` – Corralien Feb 27 '23 at 13:51
As I said in my request the expected result for the id_1 == 18 is supposed to be 1, not 2 because for its 2nd id_2 (5) there is no line in december. – GRX Feb 27 '23 at 13:57

mozway · Accepted Answer · 2023-02-27T14:08:03.597

IIUC, you can extract the month with to_datetime and dt.month, then compute a pivot_table with a custom function (are both months present for a given combination), then sum to count:

(df.assign(month=pd.to_datetime(df['Date'], format='%Y%m%d').dt.month)
   .pivot_table(index='id_1', columns='id_2', values='month',
                aggfunc=lambda x: set(x)<={1, 12})
   .sum(axis=1).reset_index(name='Nb')
)

Or with a crosstab:

pd.crosstab(df['id_1'], df['id_2'],
            values=pd.to_datetime(df['Date'], format='%Y%m%d').dt.month,
            aggfunc=lambda x: set(x)<={1, 12}
           ).sum(axis=1).reset_index(name='Nb')

Output:

   id_1   Nb
0    12  2.0
1    18  1.0

Intermediate before the sum:

id_2     1     2     5      7
id_1                         
12    True  True   NaN    NaN
18     NaN   NaN  True  False

Or using a double groupby:

(df.assign(month=pd.to_datetime(df['Date'], format='%Y%m%d').dt.month)
   .groupby(['id_1', 'id_2'])['month'].agg(lambda x: set(x)<={1, 12})
   .groupby('id_1').sum().reset_index(name='Nb')
)

A complicated operation on a pandas DataFrame

2 Answers2