3

I have a df like this:

ID   Cluster Product 
 1         4     'b'  
 1         4     'f'
 1         4     'w'
 2         7     'u'
 2         7     'b'
 3         5     'h'
 3         5     'f'
 3         5     'm'
 3         5     'd'
 4         7     's'
 4         7     'b'
 4         7     'g'

Where ID is the primary and unique key of another df that is the source for this df. Cluster is not a key, different IDs often have same Cluster value; anyway it's an information I have to carry on.

What I want to obtain is this dataframe:

ID   Cluster    Product_List_by_ID 
 1         4     ['b','f','w'] 
 2         7     ['u','b']
 3         5     ['h','f','m','d']
 4         7     ['s','b','g']

If this is not possible, also a dictionary like this could be fine:

d = {ID:[1,2,3,4], Cluster:[4,7,5,7], 
     Product_List_by_ID:[['b','f','w'],['u','b'],['h','f','m','d'],['s','b','g']]}

I have tried many ways unsuccessfully.. it seems that it is not possible to insert lists as pandas dataframe values.. Anyway I think it should not be so difficult to get the goal in some tricky way.. Sorry if I am going out of mind, but I am new to coding

Any suggests?! Thanks

piRSquared
  • 285,575
  • 57
  • 475
  • 624
mik.ferrucci
  • 121
  • 1
  • 2
  • 13

2 Answers2

8

use groupby

df.groupby(['ID', 'Cluster']).Product.apply(list)

ID  Cluster
1   4               ['b', 'f', 'w']
2   7                    ['u', 'b']
3   5          ['h', 'f', 'm', 'd']
4   7               ['s', 'b', 'g']
Name: Product, dtype: object
piRSquared
  • 285,575
  • 57
  • 475
  • 624
  • What all operations can we do after groupby. Any good resource for that? – Mohammad Yusuf Dec 20 '16 at 15:38
  • 1
    [`groupby` docs](http://pandas.pydata.org/pandas-docs/stable/groupby.html) – piRSquared Dec 20 '16 at 15:41
  • 1
    @MohammadYusufGhazi thx... commenting is difficult for me :-) – piRSquared Dec 20 '16 at 15:44
  • Great! I had not thought to the ".apply(list)". Exactly what I needed! – mik.ferrucci Dec 20 '16 at 15:45
  • Have a look to my edits.. I've noticed that len of "groupby series" is 17645, while there are 17668 different IDs.. I am sure about those numbers, may the groupby has deleted something or I have to check better my stuff?! Thanks – mik.ferrucci Dec 20 '16 at 16:19
  • @piRSquared thanks very much for this, is there a way to duplicate but not drop? as in keep columns 'Cluster' and 'ID' untouched, but create a new column with list of duplicated rows repeated? – El_1988 Oct 07 '20 at 04:18
2

Another solution is first remove ' from column Product if necessary by str.strip:

df.Product = df.Product.str.strip("'")

And then groupby with apply, last if need dictionary use to_dict with parameter orient='list'

print (df.groupby(['ID', 'Cluster'])
         .Product.apply(lambda x: x.tolist())
         .reset_index()
         .to_dict(orient='list'))

{'Cluster': [4, 7, 5, 7], 
'ID': [1, 2, 3, 4], 
'Product': [['b', 'f', 'w'], ['u', 'b'], 
            ['h', 'f', 'm', 'd'], ['s', 'b', 'g']]}
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • Very fine: now also the second way i wondered is explained! Anyway I can't find out why len of this dict[keys] is 17645 while df has 17668 different IDs.. – mik.ferrucci Dec 20 '16 at 16:32
  • Yes, I think about it now. Are data confidental? How do you test their length? – jezrael Dec 20 '16 at 16:33
  • You can check length by `print (len(df.drop_duplicates(['ID', 'Cluster']).index))` - remove duplicates by first and second column and then check length of index. – jezrael Dec 20 '16 at 16:38
  • Ok, data are confidential, I have just run your print removing duplicates: result = 17645; while with df['ID'].unique() i get that there are 17668 different IDs.. – mik.ferrucci Dec 20 '16 at 16:48
  • I see problem. You check it wrong way. Because you need unique of both columns together, not only one column `ID`. – jezrael Dec 20 '16 at 16:51
  • Because in your sample is same position of values in both columns, but in real data not. – jezrael Dec 20 '16 at 16:54
  • You can test it with this sample: `np.random.seed(100)` `df = pd.DataFrame(np.random.randint(5, size=(20,3)), columns=['ID', 'Cluster', 'Product'])` `print (df)` `print (len(df.drop_duplicates(['ID', 'Cluster']).index))` `print (df.ID.nunique())` – jezrael Dec 20 '16 at 16:54
  • Yes, I was thinking the same.. now I have checked unique by both columns, result = 17645. Evidently ID was not a "real" ID ;) Thanks! – mik.ferrucci Dec 20 '16 at 16:55
  • Thx for helping OP figure this out :-) – piRSquared Dec 20 '16 at 18:13
  • I thanks too for all helping comments. – jezrael Dec 20 '16 at 18:38