Note: this question can be associated with one existing question here. However, my question provides a more concrete example and has broader impact.
Consider we have a pandas data frame as following:
Questions cnt similarity
0 ABC 1 [1, 2, 3]
1 abc 2 [1, 2, 3]
2 cba 3 [2, 3, 1]
3 abcd 4 [4, 5, 6]
4 dcsa 5 [2, 3, 1]
5 adcd 6 [4, 5, 6]
6 abcd 7 [1, 2, 3]
7 cba 8 [7, 8, 9]
I have to add another column called cat
based on the similarity
column. If two rows have the same similarity
, then categorize them as the same group. Below is the expected output. Any input is valuable. It is worth mentioning that the original dataset has 1M
rows. Thank you.
Questions cnt similarity cat
0 ABC 1 [1, 2, 3] 1
1 abc 2 [1, 2, 3] 1
2 cba 3 [2, 3, 1] 2
3 abcd 4 [4, 5, 6] 3
4 dcsa 5 [2, 3, 1] 2
5 adcd 6 [4, 5, 6] 3
6 abcd 7 [1, 2, 3] 1
7 cba 8 [7, 8, 9] 4