0

I am working on a dataframe which has a column that each value is a list, now I want to derive a new column which only considers list whose size is greater than 1, assigns a unique integer to the corresponding row as id. If elements in two lists are the same but with a different order, the two lists should be assigned the same id. A sample dataframe is like,

document_no_list    cluster_id
[1,2,3]             1
[3,2,1]             1
[4,5,6,7]           2
[8]                 0
[9,10]              3
[10,9]              3 

column cluster_id only considers the 1st, 2nd, 3rd, 5th and 6th row, each of which has a size greater than 1, and assigns a unique integer id to its corresponding cell in the column, also [1,2,3], [3,2,1] and [9,10], [10,9] should be assigned the same cluster_id.

I was asking a similar question without considering duplicates list values, at

pandas how to derived values for a new column base on another column

I am wondering how to do that in pandas.

daiyue
  • 7,196
  • 25
  • 82
  • 149

1 Answers1

1

First, you need to assign a column with the list lengths, and another column with the lists as set objects sorted:

df['list_len'] = df.document_no_list.apply(len)
df['list_sorted'] = df.document_no_list.apply(sorted)

Then you need to assign the cluster_id for each set sorted list:

ids = df.loc[df.list_len > 1, ['list_sorted']].drop_duplicates()
ids['cluster_id'] = range(1,len(ids)+1)

Left join this onto the original dataframe, and fill whatever that hasn't been joined (the singletons) with zeros:

df.merge(ids, how = 'left').fillna({'cluster_id':0})
Ken Wei
  • 3,020
  • 1
  • 10
  • 30
  • `TypeError: unhashable type: 'set'` – daiyue Oct 26 '17 at 11:38
  • Sorry, I didn't actually try running that code. Apparently pandas runs into trouble dealing with sets, so you could sort the lists instead and compare them. – Ken Wei Oct 26 '17 at 15:43