1

I have dataframe df with columns: a, b, c,d. And i want to group data by a and make some calculations. I will provide code of this calculations in R. My main question is how to do the same in pandas?

library(dplyr)
df %>%
    group_by(a) %>%
    summarise(mean_b = mean(b),
              qt95 = quantile(b, .95),
              diff_b_c = max(b-c),
              std_b_d = sd(b)-sd(d)) %>% 
    ungroup()

This example is synthetic, I just want to understand pandas syntaxis

cs95
  • 379,657
  • 97
  • 704
  • 746
Slavka
  • 1,070
  • 4
  • 13
  • 28
  • 1
    This is not a valid question. Most pandas users don't know R, or even if they did, there seldom is a 1-1 mapping between R and pandas functions. The right way to ask this question would be to show your data (a few rows at least), provide some expected output, and an explanation on how to get there. – cs95 Dec 19 '17 at 11:42
  • 1
    You may check [here](https://stackoverflow.com/questions/43172970/python-pandas-groupby-aggregate-on-multiple-columns) – akrun Dec 19 '17 at 11:44
  • real data is absolutely unimportant. You can generate it via any random generators. I chosed very simple R functions like mean, sd, max, quantile. – Slavka Dec 19 '17 at 11:45
  • 2
    What do you mean, it's unimportant? If you want to understand pandas syntax, please go look at the documentation. If you have a specific programming question, that's where SO comes in. – cs95 Dec 19 '17 at 11:46
  • Okay, you can generate dataframe so: df = pd.DataFrame({ 'a':np.random.choice([1,2,3,4], 100), 'b': np.random.choice(range(100), 100), 'c': np.random.choice(range(100), 100), 'd': np.random.choice(range(100), 100) }) – Slavka Dec 19 '17 at 11:49
  • And what is it you want to do with `a`, `b`, `c`, and `d`? – cs95 Dec 19 '17 at 11:50
  • You should also edit your question with these details, that makes your question more valuable as a community resource. – cs95 Dec 19 '17 at 11:50

1 Answers1

6

I believe you need custom function with GroupBy.apply:

def f(x):
    mean_b = x.b.mean()
    qt95 = x.b.quantile(.95)
    diff_b_c = (x.b - x.c).max()
    std_b_d = x.b.std() - x.d.std()
    cols = ['mean_b','qt95','diff_b_c','std_b_d']
    return pd.Series([mean_b, qt95, diff_b_c, std_b_d], index=cols)

df1 = df.groupby('a').apply(f)
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252