0

Note: this question can be associated with one existing question here. However, my question provides a more concrete example and has broader impact.

Consider we have a pandas data frame as following:

   Questions  cnt similarity
0       ABC    1  [1, 2, 3]
1       abc    2  [1, 2, 3]
2       cba    3  [2, 3, 1]
3      abcd    4  [4, 5, 6]
4      dcsa    5  [2, 3, 1]
5      adcd    6  [4, 5, 6]
6      abcd    7  [1, 2, 3]
7       cba    8  [7, 8, 9]

I have to add another column called cat based on the similarity column. If two rows have the same similarity, then categorize them as the same group. Below is the expected output. Any input is valuable. It is worth mentioning that the original dataset has 1M rows. Thank you.

  Questions  cnt similarity  cat
0       ABC    1  [1, 2, 3]    1
1       abc    2  [1, 2, 3]    1
2       cba    3  [2, 3, 1]    2
3      abcd    4  [4, 5, 6]    3
4      dcsa    5  [2, 3, 1]    2
5      adcd    6  [4, 5, 6]    3
6      abcd    7  [1, 2, 3]    1
7       cba    8  [7, 8, 9]    4

Sophia
  • 377
  • 1
  • 12

2 Answers2

3

IIUC, you can use pd.factorize :

df["cat"] = pd.factorize(df["similarity"].astype(str))[0] + 1

​ Output :

print(df)

  Questions  cnt similarity  cat
0       ABC    1  [1, 2, 3]    1
1       abc    2  [1, 2, 3]    1
2       cba    3  [2, 3, 1]    2
3      abcd    4  [4, 5, 6]    3
4      dcsa    5  [2, 3, 1]    2
5      adcd    6  [4, 5, 6]    3
6      abcd    7  [1, 2, 3]    1
7       cba    8  [7, 8, 9]    4
Timeless
  • 22,580
  • 4
  • 12
  • 30
2

One way is to use groupby.ngroup():

df['cat'] = df.groupby('similarity').ngroup()+1
  Questions  cnt similarity  cat
0       ABC    1  [1, 2, 3]    1
1       abc    2  [1, 2, 3]    1
2       cba    3  [2, 3, 1]    2
3      abcd    4  [4, 5, 6]    3
4      dcsa    5  [2, 3, 1]    2
5      adcd    6  [4, 5, 6]    3
6      abcd    7  [1, 2, 3]    1
7       cba    8  [7, 8, 9]    4
Stu Sztukowski
  • 10,597
  • 1
  • 12
  • 21