3

In R/dplyr, I can do

summarise(iris, max_width=max(Sepal.Width), min_width=min(Sepal.Width))

and get:

  max_width min_width
1       4.4         2

Is there something similar to summarise in pandas? I know describe(), but I would like the result to only contain a given summary statistic for a given column, not all summary statistics for all columns. In pandas, iris.describe() gives:

        sepal_length  sepal_width  petal_length  petal_width
count    150.000000   150.000000    150.000000   150.000000
mean       5.843333     3.057333      3.758000     1.199333
std        0.828066     0.435866      1.765298     0.762238
min        4.300000     2.000000      1.000000     0.100000
25%        5.100000     2.800000      1.600000     0.300000
50%        5.800000     3.000000      4.350000     1.300000
75%        6.400000     3.300000      5.100000     1.800000
max        7.900000     4.400000      6.900000     2.500000
jaweej
  • 177
  • 1
  • 1
  • 8
  • Dupe: http://stackoverflow.com/questions/22235245/calculate-summary-statistics-of-columns-in-dataframe/22235393#22235393 , basically [`describe`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.describe.html) is the equivalent so in your case you can sub-select by passing a list of cols that you want summary info for `iris[list_of_cols].describe()` or `iris['sepal_length'].describe()` will give you stats just for that column – EdChum May 13 '16 at 12:22
  • If you're only after specific stats then you index them `iris['sepal_width'].describe().loc[['min','max']]` – EdChum May 13 '16 at 12:28
  • If you only want to do those calculations, you can also just do `pd.Series(dict(max_width=iris.sepal_width.max(), min_width=iris.sepal_width.min()))` to get almost the same output as the dplyr one. – joris May 13 '16 at 12:36
  • @EdChurn Ok, but how would this approach generalise if I had more than one column and more than one summary statistic? For example, if I wanted the output to contain: `max` of Sepal.Width, `min` of Petal.length and `median` of `Petal.Width`, say? dplyr::summarise would handle that easily. – jaweej May 13 '16 at 12:46
  • There's no built-in function You'd have to filter the resultant df after calling `describe` or like joris has stated construct it `iris['sepal_width'].max()`, `iris['petal_length'].min()`, `iris['petal_width'].median()` – EdChum May 13 '16 at 12:52
  • @joris Your proposal seems closest. Could you make that an answer? – jaweej May 13 '16 at 13:20

2 Answers2

2

As of version 0.20, agg can be called on DataFrames too (source).

So you can do things like:

iris.agg({'sepal_width': 'min', 'petal_width': 'max'})

petal_width    2.5
sepal_width    2.0
dtype: float64

iris.agg({'sepal_width': ['min', 'median'], 'sepal_length': ['min', 'mean']})

        sepal_length  sepal_width
mean        5.843333          NaN
median           NaN          3.0
min         4.300000          2.0

Also see dplyr summarize equivalent in pandas. That one focuses on groupby operations though.

ayhan
  • 70,170
  • 20
  • 182
  • 203
1

To your question: Yes, there is.

>>> from datar.all import f, summarise, max, min
>>> from datar.datasets import iris
>>> 
>>> summarise(iris, max_width=max(f.Sepal_Width), min_width=min(f.Sepal_Width))
   max_width  min_width
   <float64>  <float64>
0        4.4        2.0

I am the author of the datar package.

Panwen Wang
  • 3,573
  • 1
  • 18
  • 39