Aggregate multiple column values in pandas GroupBy as a dict

Question

I have a problem related to this question: Aggregate column values in pandas GroupBy as a dict

My input data has the following columns:

For instance, the input would have the following format

language, product_id, shelf_id, rank, fiction, pages
English, 742005, 4560, 10.2, 1.0, 456 
English, 6000075389352, 4560, 49, 1.0, 234
French, 899883993, 4560, 32, 0.0, 125
French, 731317391, 7868, 81, 1.0, 576
French, 235678655, 7868, 12, 1.0, 235

I would like to do "groupby" on language & shelf_id columns and get the list of remaining attributes for each product_id. The expected output should have the following format:

Language, shelf_id, {product_id: [rank, fiction, pages]} for each grouped record.

For the given input, my DESIRED OUTPUT would be the following:

language, shelf_id, mapping
English, 4560, {742005: [10.2, 1.0, 456], 6000075389352: [49, 1.0, 234]}
French, 4560, {899883993: [32, 0.0, 125]}
French, 7868, {731317391: [81, 1.0, 576], 235678655: [12, 1.0, 235]}

The solution provided in the above mentioned post, solves the problem nicely if only one column needs to be considered (in the resulting dictionary):

df = pd.read_csv('file.csv', header=None)  
df.columns = ['Lang', 'product_id', 'shelf_id', 'rank_id']
(df.groupby(['Lang', 'shelf_id'], as_index=False)
   .apply(lambda x: dict(zip(x['product_id'], x['rank_id'])))
   .reset_index(name='mapping'))

which would yield:

      Lang  shelf_id                              mapping
0  English      4560  {742005: 10.2, 6000075389352: 49.0}
1   French      4560                    {899883993: 32.0}
2   French      7868   {731317391: 81.0, 235678655: 12.0}

Could anyone help me adapt this solution to my case? Any suggestions would be appreciated.

Shubham Sharma · Accepted Answer · 2020-07-15T08:31:55.940

Idea is to create a new series s containing the items as tuples where the first item in tuple is product_id and second item is the list containing corresponding values from columns rank, fiction and pages, next we use Series.groupby to group the series s on language and shelf_id and aggregate the data as dictionary:

s = pd.Series([(k, v) for k, *v in zip(df['product_id'],
                                       df['rank'], df['fiction'], df['pages'])])
                                       
df = s.groupby([df['language'], df['shelf_id']]).agg(
               lambda d: dict(d.tolist())).reset_index(name='mapping')

Details:

#print(s)
0           (742005, [10.2, 1.0, 456]) # --> product_id: [rank, fiction, pages]
1    (6000075389352, [49.0, 1.0, 234])
2        (899883993, [32.0, 0.0, 125])
3        (731317391, [81.0, 1.0, 576])
4        (235678655, [12.0, 1.0, 235])
dtype: object

# print(df)
  language  shelf_id                                                      mapping
0  English      4560  {742005: [10.2, 1.0, 456], 6000075389352: [49.0, 1.0, 234]}
1   French      4560                                {899883993: [32.0, 0.0, 125]}
2   French      7868   {731317391: [81.0, 1.0, 576], 235678655: [12.0, 1.0, 235]}

Thanks! This is exactly what I needed. I really like your explanations as well. Your solution is perfect—simple, clear & very pythonic. — carpediem, Jul 15 '20 at 08:24

Aggregate multiple column values in pandas GroupBy as a dict

1 Answers1