I have a dataframe with many attributes. I want to assign an id for all unique combinations of these attributes.
assume, this is my df:
df = pd.DataFrame(np.random.randint(1,3, size=(10, 3)), columns=list('ABC'))
A B C
0 2 1 1
1 1 1 1
2 1 1 1
3 2 2 2
4 1 2 2
5 1 2 1
6 1 2 2
7 1 2 1
8 1 2 2
9 2 2 1
Now, I need to append a new column with an id for unique combinations. It has to be 0, it the combination occurs only once. In this case:
A B C unique_combination
0 2 1 1 0
1 1 1 1 1
2 1 1 1 1
3 2 2 2 0
4 1 2 2 2
5 1 2 1 3
6 1 2 2 2
7 1 2 1 3
8 1 2 2 2
9 2 2 1 0
My first approach was to use a for loop and check for every row, if I find more than one combination in the dataframe of the row's values with .query:
unique_combination = 1 #acts as a counter
df['unique_combination'] = 0
for idx, row in df.iterrows():
if len(df.query('A == @row.A & B == @row.B & C == @row.C')) > 1:
# check, if one occurrence of the combination already has a value > 0???
df.loc[idx, 'unique_combination'] = unique_combination
unique_combination += 1
However, I have no idea how to check whether there already is an ID assigned for a combination (see comment in code). Additionally my approach feels very slow and hacky (I have over 15000 rows). Do you data wrangler see a different approach to my problem?
Thank you very much!