3

I have a dataset like below and i need all the different weights for each category in single row and the count

Sample_data

  category  weights
1  aa        3.2
2  aa        2.2
3  aa        4.2
4  bb        3.5
5  bb        4.5
6  aa        0.5
7  cc        0.6
8  bb        7.5
9  cc        6.6
10 dd        2.2
11 aa        3.3
12 bb        4.4
13 cc        5.5
14 dd        6.6

And what i need is the count of each unique category and the different weights of each category in the same row.

Expected output:

 category count  weight1  weight2  weight3  weight4  weight5   
1 aa      5      3.2      2.2      4.2      0.5      3.3
2 bb      4      3.5      4.5      7.5      4.4
3 cc      3      0.6      6.6      5.5
4 dd      2      2.2      6.6

I thought

sampledata['category'].groupby(level = 0)   

will work but it is not. Can some one help me how to do this in python.

Ashwini Chaudhary
  • 244,495
  • 58
  • 464
  • 504

3 Answers3

2

I could probably shorten this but the following works:

In [51]:

cat = df.groupby('category')['weights'].agg({'count':'count', 'weight_cat':lambda x: list(x)}).reset_index()
cat
Out[51]:
  category  count                 weight_cat
0       aa      5  [3.2, 2.2, 4.2, 0.5, 3.3]
1       bb      4       [3.5, 4.5, 7.5, 4.4]
2       cc      3            [0.6, 6.6, 5.5]
3       dd      2                 [2.2, 6.6]
In [52]:

cat = cat.join(cat['weight_cat'].apply(lambda x: pd.Series(x)))
cat
Out[52]:
  category  count                 weight_cat    0    1    2    3    4
0       aa      5  [3.2, 2.2, 4.2, 0.5, 3.3]  3.2  2.2  4.2  0.5  3.3
1       bb      4       [3.5, 4.5, 7.5, 4.4]  3.5  4.5  7.5  4.4  NaN
2       cc      3            [0.6, 6.6, 5.5]  0.6  6.6  5.5  NaN  NaN
3       dd      2                 [2.2, 6.6]  2.2  6.6  NaN  NaN  NaN
In [68]:

rename_cols = [col for col in cat if type(col) == int]
rename_weight_cols = ['weight'+str(col + 1) for col in rename_cols]
d = dict(zip(rename_cols, rename_weight_cols))
cat.rename(columns = d,inplace=True)
cat
Out[68]:
  category  count                 weight_cat  weight1  weight2  weight3  \
0       aa      5  [3.2, 2.2, 4.2, 0.5, 3.3]      3.2      2.2      4.2   
1       bb      4       [3.5, 4.5, 7.5, 4.4]      3.5      4.5      7.5   
2       cc      3            [0.6, 6.6, 5.5]      0.6      6.6      5.5   
3       dd      2                 [2.2, 6.6]      2.2      6.6      NaN   

   weight4  weight5  
0      0.5      3.3  
1      4.4      NaN  
2      NaN      NaN  
3      NaN      NaN 

So what the above does is first group on the 'category' column and perform an aggregation on the weight column, we create a count column and then we turn all the values for that group into a list and add this.

I then call apply on that list to turn it into a Series, this will auto generate the names of the columns 0..4.

I then create a dict to rename the columns to weight1 through to 5 as desired.

EdChum
  • 376,765
  • 198
  • 813
  • 562
1

You could also use unstack after resetting the index of each group:

dfw = df.groupby(['category'])['weights'].apply(lambda i: i.reset_index(drop=True)).unstack(level=1)

The size (here the 'count') of each group is df.groupby(['category']).size().

dfw.rename(columns=lambda x: 'weight'+ str(x+1), inplace=True) # string concatenation to give column labels
dfw.insert(0, 'count', df.groupby(['category']).size()) #insert count at position 0

This yields:

          count  weight1  weight2  weight3  weight4  weight5
category                                                    
aa            5      3.2      2.2      4.2      0.5      3.3
bb            4      3.5      4.5      7.5      4.4      NaN
cc            3      0.6      6.6      5.5      NaN      NaN
dd            2      2.2      6.6      NaN      NaN      NaN
Vidhya G
  • 2,250
  • 1
  • 25
  • 28
0

Given that your sample data is in the form of a list of dictionaries called data, where each dictionary has a category key and a weight key, then the following code will provide you with what you need:

trans_data = {}
for item in data:
    if item['category'] in trans_data:
        trans_data[item['category']]['count'] += 1
        trans_data[item['category']]['weights'].append(item['weight'])
    else:
        trans_data[item['category']] = {'count': 1, 'weights': [item['weight'],]}

The data data structure is assumed to be of the following form:

data = [{'category': 'aa', 'weight': 3.2}, {'category': 'bb', 'weight': 2.2}, {'category': 'aa', 'weight': 1.1}]
OrenD
  • 1,751
  • 13
  • 12
  • Op added that the data is a `pandas` data type. Anyway, a list of `dict`? Rather one `dict` with keys `category` and `weights`. – Callahan Jun 01 '15 at 08:20
  • @OrenD tried your code and it is producing the following error TypeError: string indices must be integers, not str. i didn't understand how to deal with. – Surendra babu Pasumarhti Jun 01 '15 at 08:26
  • @Callahan, saw the clarifications re: panda after the fact.. The suggested sampled data does not make sense being represented in a single dict. – OrenD Jun 01 '15 at 08:30
  • @ashok, I updated the answer with how `data` should look like so you should get that error. If you're looking for a panda based solution, then this is not it and I will delete the answer.. – OrenD Jun 01 '15 at 08:31