Suppose our problem can be simplified like this:
df = pd.DataFrame()
df['C_rows'] = ['C1', 'C2', 'C3', 'C2', 'C1', 'C2', 'C3', 'C1', 'C2', 'C3', 'C4', 'C1']
df['values'] = ['customer1', 4321, 1266, 5671, 'customer2', 123, 7344,'customer3', 4321, 4444, 5674, 'customer4']
with the table:
C_rows values
0 C1 customer1
1 C2 4321
2 C3 1266
3 C2 5671
4 C1 customer2
5 C2 123
6 C3 7344
7 C1 customer3
8 C2 4321
9 C3 4444
10 C4 5674
11 C1 customer4
How can we vectorise finding duplicate C_rows
between each C1
,
i.e. row3
has duplicate C2
occurring in rows 1 and 3.
The dataset I am working with has 50,000 rows, and between each C1
is about 15 rows.
e.g. check duplicates like this:
C_rows values
0 C1 customer1
1 C2 4321
2 C3 1266
3 C2 5671
C2 is duplicate
4 C1 customer2
5 C2 123
6 C3 7344
No duplicates
7 C1 customer3
8 C2 4321
9 C3 4444
10 C4 5674
No duplicates
without using for loops - and quick (vectorised).