0

I would like to obtain, for each id_1, the number of common id_2s between the months of December and January. The example below:

id_1 id_2 Date
12 1 20221216
12 1 20230113
12 1 20230116
12 2 20221213
12 2 20230118
18 7 20221207
18 7 20220907
18 7 20230113
18 5 20230118

should return :

id_1 Nb
12 2
18 1

I have looked for a short and optimal way to do this but I haven't found it. I would like to avoid doing several merges. Would you have ideas?

GRX
  • 13
  • 2

2 Answers2

1

You have to filter your dataframe before counting values:

m = df['Date'].astype(int).between(20221201, 20230131)
counts = df[m].groupby('id_1', as_index=False)['id_2'].nunique()

Or using DatetimeIndex:

m = pd.to_datetime(df['Date'], format='%Y%m%d').between('2022-12-01', '2023-01-31')
counts = df[m].groupby('id_1', as_index=False)['id_2'].nunique()

Output:

>>> counts
   id_1  id_2
0    12     2
1    18     2

Suggested by @mozway:

counts = df[m].groupby('id_1', as_index=False).agg(Nb=('id_2', 'nunique'))
Corralien
  • 109,409
  • 8
  • 28
  • 52
  • If you want to rename `id_2` column, use `counts = counts.rename(columns={'id_2': 'Nb'})` – Corralien Feb 27 '23 at 13:51
  • As I said in my request the expected result for the id_1 == 18 is supposed to be 1, not 2 because for its 2nd id_2 (5) there is no line in december. – GRX Feb 27 '23 at 13:57
1

IIUC, you can extract the month with to_datetime and dt.month, then compute a pivot_table with a custom function (are both months present for a given combination), then sum to count:

(df.assign(month=pd.to_datetime(df['Date'], format='%Y%m%d').dt.month)
   .pivot_table(index='id_1', columns='id_2', values='month',
                aggfunc=lambda x: set(x)<={1, 12})
   .sum(axis=1).reset_index(name='Nb')
)

Or with a crosstab:

pd.crosstab(df['id_1'], df['id_2'],
            values=pd.to_datetime(df['Date'], format='%Y%m%d').dt.month,
            aggfunc=lambda x: set(x)<={1, 12}
           ).sum(axis=1).reset_index(name='Nb')

Output:

   id_1   Nb
0    12  2.0
1    18  1.0

Intermediate before the sum:

id_2     1     2     5      7
id_1                         
12    True  True   NaN    NaN
18     NaN   NaN  True  False

Or using a double groupby:

(df.assign(month=pd.to_datetime(df['Date'], format='%Y%m%d').dt.month)
   .groupby(['id_1', 'id_2'])['month'].agg(lambda x: set(x)<={1, 12})
   .groupby('id_1').sum().reset_index(name='Nb')
)
mozway
  • 194,879
  • 13
  • 39
  • 75