1

Hi I am using the pandas/python and have a dataframe along the following lines:

21627   red
21627   green
21627   red
21627   blue
21627   purple
21628   yellow
21628   red
21628   green
21629   red
21629   red

Which I want to reduce to:

21627   red, green, blue, purple
21628   yellow, red, green
21629   red

Whats the best way of doing this (and collapsing all values in lists to unique values)?

Also, if I wanted to keep the redundancy:

21627   red, green, red, blue, purple
21628   yellow, red, green
21629   red, red

Whats the best way of achieving this?

Thanks in advance for any help.

user7289
  • 32,560
  • 28
  • 71
  • 88
  • answer [here](http://stackoverflow.com/questions/11391969/how-to-group-pandas-dataframe-entries-by-date-in-a-non-unique-column) and tutorial [here](http://wesmckinney.com/blog/?p=125) – lucasg Aug 22 '13 at 13:29

2 Answers2

7

If you really wanted to do this you could use a groupby apply:

In [11]: df.groupby('id').apply(lambda x: list(set(x['colours'])))
Out[11]: 
id
21627    [blue, purple, green, red]
21628          [green, red, yellow]
21629                         [red]
dtype: object

In [12]: df.groupby('id').apply(lambda x: list(x['colours']))
Out[12]: 
id
21627    [red, green, red, blue, purple]
21628               [yellow, red, green]
21629                         [red, red]
dtype: object

However, DataFrames containing lists are not particularly efficient.

Pivot table gets you a more useful DataFrame:

In [21]: df.pivot_table(rows='id', cols='colours', aggfunc=len, fill_value=0)
Out[21]: 
colours  blue  green  purple  red  yellow
id                                       
21627       1      1       1    2       0
21628       0      1       0    1       1
21629       0      0       0    2       0

My favourite function get_dummies lets you do it but not as elegantly or efficiently (but I'll keep this original, if crazy, suggestion):

In [22]: pd.get_dummies(df.set_index('id')['colours']).reset_index().groupby('id').sum()
Out[22]: 
       blue  green  purple  red  yellow
id                                     
21627     1      1       1    2       0
21628     0      1       0    1       1
21629     0      0       0    2       0
Andy Hayden
  • 359,921
  • 101
  • 625
  • 535
  • mabye add to cookbook these types of recipes – Jeff Aug 22 '13 at 14:04
  • Hey Andy, thanks - I am going to use the lists against each ID as a table to index in a search engine - hence wanted the lists of keywords against each ID – user7289 Aug 22 '13 at 17:39
2

Here's another way; Though @Andy's a bit more intuitve

In [24]: df.groupby('id').apply(
              lambda x: x['color'].value_counts()).unstack().fillna(0)
Out[24]: 
       blue  green  purple  red  yellow
id                                     
21627     1      1       1    2       0
21628     0      1       0    1       1
21629     0      0       0    2       0
Jeff
  • 125,376
  • 21
  • 220
  • 187
  • That's not how you spell color :p, `value_counts` makes it more intuitive, but I think `pivot_table` is the way to do it. – Andy Hayden Aug 22 '13 at 14:10
  • 2
    I agree ``pivot_table`` is better; (this is basically what it does internally anyhow); I always though ``colour`` was a colloquial spelling (deprecated) :) – Jeff Aug 22 '13 at 14:15