2

I have a dataframe with many attributes. I want to assign an id for all unique combinations of these attributes.

assume, this is my df:

df = pd.DataFrame(np.random.randint(1,3, size=(10, 3)), columns=list('ABC'))

   A  B  C
0  2  1  1
1  1  1  1
2  1  1  1
3  2  2  2
4  1  2  2
5  1  2  1
6  1  2  2
7  1  2  1
8  1  2  2
9  2  2  1

Now, I need to append a new column with an id for unique combinations. It has to be 0, it the combination occurs only once. In this case:

   A  B  C  unique_combination
0  2  1  1  0
1  1  1  1  1
2  1  1  1  1
3  2  2  2  0
4  1  2  2  2
5  1  2  1  3
6  1  2  2  2
7  1  2  1  3
8  1  2  2  2
9  2  2  1  0

My first approach was to use a for loop and check for every row, if I find more than one combination in the dataframe of the row's values with .query:

unique_combination = 1 #acts as a counter
df['unique_combination'] = 0    

for idx, row in df.iterrows():
    if len(df.query('A == @row.A & B == @row.B & C == @row.C')) > 1:
        # check, if one occurrence of the combination already has a value > 0???
        df.loc[idx, 'unique_combination'] = unique_combination
        unique_combination += 1

However, I have no idea how to check whether there already is an ID assigned for a combination (see comment in code). Additionally my approach feels very slow and hacky (I have over 15000 rows). Do you data wrangler see a different approach to my problem?

Thank you very much!

zozo
  • 378
  • 3
  • 7
potofski
  • 101
  • 1
  • 3
  • 12

2 Answers2

4

Step1 : Assign a new column with values 0

df['new'] = 0

Step2 : Create a mask with repetition more than 1 i.e

mask = df.groupby(['A','B','C'])['new'].transform(lambda x : len(x)>1)

Step3 : Assign the values factorizing based on mask i.e

df.loc[mask,'new'] = df.loc[mask,['A','B','C']].astype(str).sum(1).factorize()[0] + 1

# or
# df.loc[mask,'new'] = df.loc[mask,['A','B','C']].groupby(['A','B','C']).ngroup()+1 

Output:

   A  B  C  new
0  2  1  1    0
1  1  1  1    1
2  1  1  1    1
3  2  2  2    0
4  1  2  2    2
5  1  2  1    3
6  1  2  2    2
7  1  2  1    3
8  1  2  2    2
9  2  2  1    0
Bharath M Shetty
  • 30,075
  • 6
  • 57
  • 108
4

A new feature added in Pandas version 0.20.2 creates a column of unique ids automatically for you.

df['unique_id'] = df.groupby(['A', 'B', 'C']).ngroup()

gives the following output

    A   B   C   unique_id
0   2   1   2   3
1   2   2   1   4
2   1   2   1   1
3   1   2   2   2
4   1   1   1   0
5   1   2   1   1
6   1   1   1   0
7   2   2   2   5
8   1   2   2   2
9   1   2   2   2

The groups are given ids based on the order they would be iterated over.

See the documentation here: https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html#enumerate-groups

zozo
  • 378
  • 3
  • 7