2

I have a dataframe that has a column that each value is a list, now I want to derive a new column which only considers list whose size is greater than 1, and assigns a unique integer to the corresponding row as id. A sample dataframe is like,

document_no_list    cluster_id
[1,2,3]             1
[4,5,6,7]           2
[8]                 nan
[9,10]              3 

column cluster_id only considers the 1st, 2nd and 4th row, each of which has a size greater than 1, and assigns a unique integer id to its corresponding cell in the column.

I am wondering how to do that in pandas.

daiyue
  • 7,196
  • 25
  • 82
  • 149
  • If you want to use the proper python `NaN`, the dtype for `cluster_id` cannot be `int`. Here's a [link](https://stackoverflow.com/questions/12708807/numpy-integer-nan) for more clarification. – Daniel Lenz Oct 25 '17 at 17:02

2 Answers2

2

We can use np.random.choice for unique random values with .loc for assignment i.e

df = pd.DataFrame({'document_no_list' :[[1,2,3],[4,5,6,7],[8],[9,10]]})

x = df['document_no_list'].apply(len) > 1 

df.loc[x,'Cluster'] =  np.random.choice(range(len(df)),x.sum(),replace=False)

Output :

 document_no_list  Cluster
0        [1, 2, 3]      2.0
1     [4, 5, 6, 7]      1.0
2              [8]      NaN
3          [9, 10]      3.0

If you want continuous numbers then you can use

df.loc[x,'Cluster'] =  np.arange(x.sum())+1
 document_no_list  Cluster
0        [1, 2, 3]      1.0
1     [4, 5, 6, 7]      2.0
2              [8]      NaN
3          [9, 10]      3.0

Hope it helps

Bharath M Shetty
  • 30,075
  • 6
  • 57
  • 108
1

Create a boolean column based on condition and apply cumsum() on rows with 1's

df['cluster_id'] = df['document_no_list'].apply(lambda x: len(x)> 1).astype(int)

df.loc[df['cluster_id'] == 1, 'cluster_id'] = df.loc[df['cluster_id'] == 1, 'cluster_id'].cumsum()


document_no_list    cluster_id
0   [1, 2, 3]       1
1   [4, 5, 6, 7]    2
2   [8]             0
3   [9, 10]         3
Vaishali
  • 37,545
  • 5
  • 58
  • 86
  • what does `cumsum` do in this case? – daiyue Oct 26 '17 at 02:23
  • 1
    The cluster_id column contains 1s and 0s based on the condition True or False. Now if you filter this column to select only the rows with 1s and cumulatively sum them, you would get numbers from 1 incrementing by 1 :) – Vaishali Oct 26 '17 at 03:28