getting the unique values of every column in a pandas dataframe - to help me create smaller more manageable dataframes to perform metrics on

Question

I started off wanting to turn a column from a pandas dataframe into a list, and then get the unique values, with the aim of iterating over those unique values in a for loop, and creating a few smaller dataframes. I.e. one for each cluster. Then I want to store these smaller dataframes in a dictionary object.

@ben suggested I start a new question and ask about the GroupBy Method of pandas dataframes to perform this task?

My original post is over here: get list from pandas dataframe column

My Data: 
cluster load_date   budget  actual  fixed_price
A   1/1/2014    1000    4000    Y
A   2/1/2014    12000   10000   Y
A   3/1/2014    36000   2000    Y
B   4/1/2014    15000   10000   N
B   4/1/2014    12000   11500   N
B   4/1/2014    90000   11000   N
C   7/1/2014    22000   18000   N
C   8/1/2014    30000   28960   N
C   9/1/2014    53000   51200   N

For example: for item in cluster_list(where cluster list is the unique set of values in cluster)

create a dataframe for cluster a, where budget > X etc

Then do the same for the other clusters, and put them in a dictionary.

Then be able to get a certain dataframe out of the dictionary, say only the dataframe for cluster B where budget > X

GetDf(key):
  return dict(key)

Thanks in advance

score 4 · Answer 1 · answered Mar 12 '14 at 05:26

There's two parts to this question. First, filter those columns where budget < X:

In [11]: df1 = df[df['budget'] > 10000]

In [12]: df1
Out[12]:
  cluster load_date  budget  actual fixed_price
1       A  2/1/2014   12000   10000           Y
2       A  3/1/2014   36000    2000           Y
3       B  4/1/2014   15000   10000           N
4       B  4/1/2014   12000   11500           N
5       B  4/1/2014   90000   11000           N
6       C  7/1/2014   22000   18000           N
7       C  8/1/2014   30000   28960           N
8       C  9/1/2014   53000   51200           N

Now you can groupby cluster, and get the groups:

In [13]: g = df1.groupby('cluster')

In [14]: g.get_group('A')
Out[14]:
  cluster load_date  budget  actual fixed_price
1       A  2/1/2014   12000   10000           Y
2       A  3/1/2014   36000    2000           Y

Note: if you really want a dictionary then you can use:

In [15]: d = dict(iter(g))

In [16]: d['A']
Out[16]:
  cluster load_date  budget  actual fixed_price
1       A  2/1/2014   12000   10000           Y
2       A  3/1/2014   36000    2000           Y

Would d = dict(g.get_group('A')) work as well for storing dictionaries? The reason I wanted to use a dictionary was because I wanted different pandas data frames for each group or cluster. I want to be able to run metrics over each of them separately eg mean max etc — yoshiserry, Mar 13 '14 at 09:36
you should be using groupby, and you can just go things like g.max(), g.sum() etc. — Andy Hayden, Mar 13 '14 at 16:26

getting the unique values of every column in a pandas dataframe - to help me create smaller more manageable dataframes to perform metrics on

1 Answers1

Linked