0

I have been searching through the web whether there is a simple method when using python/pandas to get a dataframe consisting only the unique rows and their basic stats (occurences, mean, and so on) from an original dataframe.

So far my efforts came only half way: I found how to get all the unique rows using

 data.drop_duplicates

But then Im not quite sure how I should retrieve all the stats I desire easily. I could do a for loop on a groupedby, but that would be rather slow.

Another approach that I thought of was using the groupby and then use describe, e.g.,

data.groupby(allColumns)[columnImInterestedInForStats].describe()

But it turns out that this, for 19 columns in allColumns, only returns me one row with no stats at all. Surprisingly, if I choose only a small subset for allColumns, I actually do get each unique combination of the subset and all their stats. My expectation was that if I fill in all 19 columns in groupby() I would get all unique groups?

Data example:

df = pd.DataFrame([[1.1, 1.1, 1.1, 2.6, 2.5, 3.4,2.6,2.6,3.4,3.4,2.6,1.1,1.1,3.3], list('AAABBBBABCBDDD'), ['1','3','3','2','4','2','5','3','6','3','5','1','1','1']]).T
df.columns = ['col1','col2','col3']

Desired result:

col2 col3  mean  count and so on
A    1      1.1    1
     3      4.8    3
B    2      6.0    2
     4      2.5    1
     5      5.2    2
     6      3.4    1
C    3      3.4    1
D    1      5.5    3

into a dataframe.

Im sure it must be something very trivial that Im missing, but I cant find the proper answer. Thanks in advance.

  • I found that when I try many columns in the group by, say 14, then only a small portion of the data actually seems to be used and reported. E.g., a dataset of 2066 rows reports only the last 50 rows. As if the first 2000 don't even exist? – Ben Haanstra Aug 06 '15 at 21:45

1 Answers1

0

You can achieve desired effect using agg().

import pandas as pd
import numpy as np

df = pd.DataFrame([[1.1, 1.1, 1.1, 2.6, 2.5, 3.4,2.6,2.6,3.4,3.4,2.6,1.1,1.1,3.3], list('AAABBBBABCBDDD'), \
                   ['1','3','3','2','4','2','5','3','6','3','5','1','1','1']]).T
df.columns = ['col1','col2','col3']
df['col1'] = df['col1'].astype(float)

df.groupby(['col2','col3'])['col1'].agg([np.mean,'count',np.max,np.min,np.median])

In place of 'col1' in df.groupby you can place list of columns you are interested in.

zuku
  • 649
  • 8
  • 24
  • Thank you for your response. The code that you showed is precisely what I have tried, but with many columns (14+) it seems not to work. It will simply forget all the groups except the last as if there is a limit. I have not found any info online yet about such limitations. – Ben Haanstra Aug 08 '15 at 08:25
  • I guess you have some missing values in these 14+ columns. Pandas groupby is omitting missing values. You can use `df.fillna()` on columns you group by. Please look at [link](http://stackoverflow.com/questions/31866781/pandas-groupby-ignoring-nas/31872433#31872433) – zuku Aug 08 '15 at 20:49
  • I think that was the cause! Thank you very much for pointing this gimmick out! – Ben Haanstra Aug 09 '15 at 15:35