pandas how to derived values for a new column base on another column

Question

I have a dataframe that has a column that each value is a list, now I want to derive a new column which only considers list whose size is greater than 1, and assigns a unique integer to the corresponding row as id. A sample dataframe is like,

document_no_list    cluster_id
[1,2,3]             1
[4,5,6,7]           2
[8]                 nan
[9,10]              3

column cluster_id only considers the 1st, 2nd and 4th row, each of which has a size greater than 1, and assigns a unique integer id to its corresponding cell in the column.

I am wondering how to do that in pandas.

If you want to use the proper python `NaN`, the dtype for `cluster_id` cannot be `int`. Here's a [link](https://stackoverflow.com/questions/12708807/numpy-integer-nan) for more clarification. — Daniel Lenz, Oct 25 '17 at 17:02

Bharath M Shetty · Accepted Answer · 2017-10-25T16:59:56.783

2

We can use np.random.choice for unique random values with .loc for assignment i.e

df = pd.DataFrame({'document_no_list' :[[1,2,3],[4,5,6,7],[8],[9,10]]})

x = df['document_no_list'].apply(len) > 1 

df.loc[x,'Cluster'] =  np.random.choice(range(len(df)),x.sum(),replace=False)

Output :

 document_no_list  Cluster
0        [1, 2, 3]      2.0
1     [4, 5, 6, 7]      1.0
2              [8]      NaN
3          [9, 10]      3.0

If you want continuous numbers then you can use

df.loc[x,'Cluster'] =  np.arange(x.sum())+1

 document_no_list  Cluster
0        [1, 2, 3]      1.0
1     [4, 5, 6, 7]      2.0
2              [8]      NaN
3          [9, 10]      3.0

Hope it helps

edited Oct 25 '17 at 16:59

answered Oct 25 '17 at 16:50

Bharath M Shetty

30,075
6
57
108

1

Great answer! Could use x.sum() instead of len(x[x]), might be a bit clearer. – Daniel Lenz Oct 25 '17 at 16:58
Thanks @DanielLenz thats really nice – Bharath M Shetty Oct 25 '17 at 17:00

score 1 · Answer 2 · answered Oct 25 '17 at 17:39

1

Create a boolean column based on condition and apply cumsum() on rows with 1's

df['cluster_id'] = df['document_no_list'].apply(lambda x: len(x)> 1).astype(int)

df.loc[df['cluster_id'] == 1, 'cluster_id'] = df.loc[df['cluster_id'] == 1, 'cluster_id'].cumsum()


document_no_list    cluster_id
0   [1, 2, 3]       1
1   [4, 5, 6, 7]    2
2   [8]             0
3   [9, 10]         3

answered Oct 25 '17 at 17:39

Vaishali

37,545
5
58
86

what does `cumsum` do in this case? – daiyue Oct 26 '17 at 02:23
1

The cluster_id column contains 1s and 0s based on the condition True or False. Now if you filter this column to select only the rows with 1s and cumulatively sum them, you would get numbers from 1 incrementing by 1 :) – Vaishali Oct 26 '17 at 03:28

pandas how to derived values for a new column base on another column

2 Answers2

Linked