Pandas df manipulation: new column with list of values if other column rows repeated

Question

I have a df like this:

ID   Cluster Product 
 1         4     'b'  
 1         4     'f'
 1         4     'w'
 2         7     'u'
 2         7     'b'
 3         5     'h'
 3         5     'f'
 3         5     'm'
 3         5     'd'
 4         7     's'
 4         7     'b'
 4         7     'g'

Where ID is the primary and unique key of another df that is the source for this df. Cluster is not a key, different IDs often have same Cluster value; anyway it's an information I have to carry on.

What I want to obtain is this dataframe:

ID   Cluster    Product_List_by_ID 
 1         4     ['b','f','w'] 
 2         7     ['u','b']
 3         5     ['h','f','m','d']
 4         7     ['s','b','g']

If this is not possible, also a dictionary like this could be fine:

d = {ID:[1,2,3,4], Cluster:[4,7,5,7], 
     Product_List_by_ID:[['b','f','w'],['u','b'],['h','f','m','d'],['s','b','g']]}

I have tried many ways unsuccessfully.. it seems that it is not possible to insert lists as pandas dataframe values.. Anyway I think it should not be so difficult to get the goal in some tricky way.. Sorry if I am going out of mind, but I am new to coding

Any suggests?! Thanks

piRSquared · Accepted Answer · 2016-12-20T17:50:02.440

8

use groupby

df.groupby(['ID', 'Cluster']).Product.apply(list)

ID  Cluster
1   4               ['b', 'f', 'w']
2   7                    ['u', 'b']
3   5          ['h', 'f', 'm', 'd']
4   7               ['s', 'b', 'g']
Name: Product, dtype: object

edited Dec 20 '16 at 17:50

answered Dec 20 '16 at 15:35

piRSquared

285,575
57
475
624

What all operations can we do after groupby. Any good resource for that? – Mohammad Yusuf Dec 20 '16 at 15:38
1

[`groupby` docs](http://pandas.pydata.org/pandas-docs/stable/groupby.html) – piRSquared Dec 20 '16 at 15:41
1

@MohammadYusufGhazi thx... commenting is difficult for me :-) – piRSquared Dec 20 '16 at 15:44
Great! I had not thought to the ".apply(list)". Exactly what I needed! – mik.ferrucci Dec 20 '16 at 15:45
Have a look to my edits.. I've noticed that len of "groupby series" is 17645, while there are 17668 different IDs.. I am sure about those numbers, may the groupby has deleted something or I have to check better my stuff?! Thanks – mik.ferrucci Dec 20 '16 at 16:19
@piRSquared thanks very much for this, is there a way to duplicate but not drop? as in keep columns 'Cluster' and 'ID' untouched, but create a new column with list of duplicated rows repeated? – El_1988 Oct 07 '20 at 04:18

jezrael · Answer 2 · 2016-12-20T16:28:22.440

2

Another solution is first remove ' from column Product if necessary by str.strip:

df.Product = df.Product.str.strip("'")

And then groupby with apply, last if need dictionary use to_dict with parameter orient='list'

print (df.groupby(['ID', 'Cluster'])
         .Product.apply(lambda x: x.tolist())
         .reset_index()
         .to_dict(orient='list'))

{'Cluster': [4, 7, 5, 7], 
'ID': [1, 2, 3, 4], 
'Product': [['b', 'f', 'w'], ['u', 'b'], 
            ['h', 'f', 'm', 'd'], ['s', 'b', 'g']]}

edited Dec 20 '16 at 16:28

answered Dec 20 '16 at 16:16

jezrael

822,522
95
1,334
1,252

Very fine: now also the second way i wondered is explained! Anyway I can't find out why len of this dict[keys] is 17645 while df has 17668 different IDs.. – mik.ferrucci Dec 20 '16 at 16:32
Yes, I think about it now. Are data confidental? How do you test their length? – jezrael Dec 20 '16 at 16:33
You can check length by `print (len(df.drop_duplicates(['ID', 'Cluster']).index))` - remove duplicates by first and second column and then check length of index. – jezrael Dec 20 '16 at 16:38
Ok, data are confidential, I have just run your print removing duplicates: result = 17645; while with df['ID'].unique() i get that there are 17668 different IDs.. – mik.ferrucci Dec 20 '16 at 16:48
I see problem. You check it wrong way. Because you need unique of both columns together, not only one column `ID`. – jezrael Dec 20 '16 at 16:51
Because in your sample is same position of values in both columns, but in real data not. – jezrael Dec 20 '16 at 16:54
You can test it with this sample: `np.random.seed(100)` `df = pd.DataFrame(np.random.randint(5, size=(20,3)), columns=['ID', 'Cluster', 'Product'])` `print (df)` `print (len(df.drop_duplicates(['ID', 'Cluster']).index))` `print (df.ID.nunique())` – jezrael Dec 20 '16 at 16:54
Yes, I was thinking the same.. now I have checked unique by both columns, result = 17645. Evidently ID was not a "real" ID ;) Thanks! – mik.ferrucci Dec 20 '16 at 16:55
Thx for helping OP figure this out :-) – piRSquared Dec 20 '16 at 18:13
I thanks too for all helping comments. – jezrael Dec 20 '16 at 18:38

Pandas df manipulation: new column with list of values if other column rows repeated

2 Answers2