plyr or dplyr in Python

Question

This is more of a conceptual question, I do not have a specific problem.

I am learning python for data analysis, but I am very familiar with R - one of the great things about R is plyr (and of course ggplot2) and even better dplyr. Pandas of course has split-apply as well however in R I can do things like (in dplyr, a bit different in plyr, and I can see now how dplyr mimics the . notation from object programming)

   data %.% group_by(c(.....)) %.% summarise(new1 = ...., new2 = ...., ..... newn=....)

in which I create multiple summary calculations at the same time

How do I do that in python, because

df[...].groupby(.....).sum() only sums columns,

while on R I can have one mean, one sum, one special function, etc. on one call

I realize I can do all my operations separately and merge them, and that is fine if I am using python, but when it comes down to choosing a tool, any line of code you do not have to type and check and validate adds up in time

in addition, in dplyr you can also add mutate statements as well, so it seems to me it is way more powerful - so what am I missing about pandas or python -

My goal is to learn, I have spent a lot of effort to learn python and it is a worthy investment, but still the question remains

lgallen · Answer 1 · 2015-04-12T03:43:42.293

I'm also a big fan of dplyr for R and am working to improve my knowledge of Pandas. Since you don't have a specific problem, I'd suggest checking out the post below that breaks down the entire introductory dplyr vignette and shows how all of it can be done with Pandas.

For example, the author demonstrates chaining with the pipe operator in R:

 flights %>%
   group_by(year, month, day) %>%
   select(arr_delay, dep_delay) %>%
   summarise(
      arr = mean(arr_delay, na.rm = TRUE),
      dep = mean(dep_delay, na.rm = TRUE)
       ) %>%
   filter(arr > 30 | dep > 30)

And here is the Pandas implementation:

flights.groupby(['year', 'month', 'day'])
   [['arr_delay', 'dep_delay']]
   .mean()
   .query('arr_delay > 30 | dep_delay > 30')

There are many more comparisons of how to implement dplyr like operations with Pandas at the original post. http://nbviewer.ipython.org/gist/TomAugspurger/6e052140eaa5fdb6e8c0

While this may answer the question, [it would be preferable](http://meta.stackoverflow.com/q/8259) to include the essential parts of the answer here (e.g. an example addressing the question's sample problem?), and provide the link for reference. — Nathan Tuggy, Apr 12 '15 at 02:51
Answer edited to include information about chaining, which is necessary for the original poster's point about performing multiple summary calculations. — lgallen, Apr 12 '15 at 03:46

lgautier · Answer 2 · 2016-05-07T19:43:38.587

One could simply use dplyr from Python.

There is an interface to dplyr in rpy2 (introduced with rpy2-2.7.0) that lets you write things like:

dataf = (DataFrame(mtcars).
         filter('gear>3').
         mutate(powertoweight='hp*36/wt').
         group_by('gear').
         summarize(mean_ptw='mean(powertoweight)'))

There is an example in the documentation. This part of the doc is (also) a jupyter notebook. Look for the links near the top of page.

An other answer to the question is comparing R's dplyr and pandas (see @lgallen). That same R one-liner chaining dplyr statements write's essentially the same in rpy2's interface to dplyr.

R:

flights %>%
   group_by(year, month, day) %>%
   select(arr_delay, dep_delay) %>%
   summarise(
      arr = mean(arr_delay, na.rm = TRUE),
      dep = mean(dep_delay, na.rm = TRUE)
      ) %>%
   filter(arr > 30 | dep > 30)

Python+rpy2:

(DataFrame(flights).
 group_by('year', 'month', 'day').
 select('arr_delay', 'dep_delay').
 summarize(arr = 'mean(arr_delay, na.rm=TRUE)',
           dep = 'mean(dep_delay, na.rm=TRUE)').
 filter('arr > 30 | dep > 30'))

But then you aren't even using native pandas dataframes anymore right? — Allen Wang, Oct 31 '16 at 15:20
@AllenWang There are tools to shuttle between in-memory R data frames and pandas data frame (converters in rpy2). Beside that, the notion of native pandas dataframes might be less strict than it seems when one is considering persistent data frames (HDF5, (remote) SQL table(s), feather) as they can be accessed from Python/pandas or R (therefore Python/rpy2) as data frames in their respective worlds. — lgautier, Oct 31 '16 at 16:30

szxk · Accepted Answer · 2014-11-12T03:13:42.323

7

I think you're looking for the agg function, which is applied to groupby objects.

From the docs:

In [48]: grouped = df.groupby('A')

In [49]: grouped['C'].agg([np.sum, np.mean, np.std])
Out[49]: 
          sum      mean       std
A                                
bar  0.443469  0.147823  0.301765
foo  2.529056  0.505811  0.96

edited Nov 12 '14 at 03:13

answered Nov 12 '14 at 03:02

szxk

1,769
18
35

3

That seems more of a comment than an answer. But on the other hand no reproducible exampel was offered, so .... perhaps. As an R user I was amused to see that the `agg`-function is not very much like the R aggregate function. – IRTFM Nov 12 '14 at 03:06
OP asked for a way to apply multiple aggregate functions at the same time. A short answer is still an answer. – szxk Nov 12 '14 at 03:08
On the cited page, it looked like `grouped.agg()` might have been needed. – IRTFM Nov 12 '14 at 03:11
Okay, will clean up answer with an example – szxk Nov 12 '14 at 03:12
thanks for the answer, i do not need a real example as it is right I did not provide an actual problem. I clearly stated it was a conceptual question – user1617979 Nov 12 '14 at 12:06
I read the agg documentation and I have a further question. in deployer I can specify both a name for a new column and what column of the original it applies to. In agg it appears you can do either but not both (or an example is not provided to do that) - is there a way to do that? – user1617979 Nov 12 '14 at 12:21
Not sure if that's possible out of the box. Check out [this answer](http://stackoverflow.com/a/15322715/2907617) for a workaround. – szxk Nov 12 '14 at 14:05
This answer addresses the group_by() function of dplyr. The OP asked how to chain data manipulation functions (including mutate) into a single call. The actual answer is, 'you can't, while still remaining pythonic. It's distinctly an R thing'. This is a workaround for one part of the question. – TheProletariat Aug 10 '17 at 16:37

score 5 · Answer 4 · answered Dec 13 '18 at 22:09

The most similar way to use dplyr in python, is with the dfply package. Here is an example.

R dplyr

library(nycflights13)
library(dplyr)

flights %>%
  filter(hour > 10) %>% # step 1
  mutate(speed = distance / (air_time * 60)) %>% # step 2
  group_by(origin) %>% # step 3a
  summarize(mean_speed = sprintf("%0.6f",mean(speed, na.rm = T))) %>% # step 3b
  arrange(desc(mean_speed)) # step 4

# A tibble: 3 x 2
  origin mean_speed
  <chr>  <chr>     
1 EWR    0.109777  
2 JFK    0.109427  
3 LGA    0.107362

Python dfply

from dfply import *
import pandas as pd

flight_data = pd.read_csv('nycflights13.csv')

(flight_data >>
  mask(X.hour > 10) >> # step 1
  mutate(speed = X.distance / (X.air_time * 60)) >> # step 2
  group_by(X.origin) >> # step 3a
  summarize(mean_speed = X.speed.mean()) >> # step 3b
  arrange(X.mean_speed, ascending=False) # step 4
)


Out[1]: 
  origin  mean_speed
0    EWR    0.109777
1    JFK    0.109427
2    LGA    0.107362

Python Pandas

flight_data.loc[flight_data['hour'] > 10, 'speed'] = flight_data['distance'] / (flight_data['air_time'] * 60)
result = flight_data.groupby('origin', as_index=False)['speed'].mean()
result.sort_values('speed', ascending=False)

Out[2]: 
  origin     speed
0    EWR  0.109777
1    JFK  0.109427
2    LGA  0.107362

Note: For more information you can check the following link.

great tool! would it be possible to give a reproducible example (without importing data fro msomewhere?) — agenis, Feb 07 '19 at 14:15
One option is to install `ggplot` (in python $ pip install ggplot), which includes different databases like `mtcars`, there are numerous examples of the use of dplyr with this database. — Rafael Díaz, Feb 07 '19 at 17:25

score 1 · Answer 5 · answered Nov 20 '20 at 19:16

for dplyr i use dfply which have the same syntax except we use '>>' as pipe operator while in dplyr we used %>% You may use plotnine as ggplot2. I am not sharing the code for dfply as it is already shared above however you can check the below link for plotnine

https://plotnine.readthedocs.io/en/stable/gallery.html

score 1 · Answer 6 · answered Jun 24 '21 at 18:01

1

Now we have a close port of dplyr and other related packages from R to python:

https://github.com/pwwang/datar

Disclaimer: I am the author of the package.

answered Jun 24 '21 at 18:01

Panwen Wang

3,573
1
18
39

score 0 · Answer 7 · answered May 05 '22 at 23:34

One more example of group-by aggregations in R dplyr and Python Pandas. Using the iris dataset grouping by Species and summarise max,mean, median and min of each column:

library(tidyverse)

iris %>% group_by(Species) %>% 
  summarise(max(Sepal.Length),mean(Sepal.Width),median(Petal.Width),min(Petal.Length))

# A tibble: 3 x 5
  Species    `max(Sepal.Length)` `mean(Sepal.Width)` `median(Petal.Width)` `min(Petal.Length)`
  <fct>                    <dbl>               <dbl>                 <dbl>               <dbl>
1 setosa                     5.8                3.43                   0.2                 1  
2 versicolor                 7                  2.77                   1.3                 3  
3 virginica                  7.9                2.97                   2                   4.5

write_csv(iris, "iris.csv")

Same Thing with Pandas:

import pandas as pd
import numpy as np
df = pd.read_csv("iris.csv")

df_gb = pd.DataFrame()
df_gb['max Sepal.Length'] = df.groupby(['Species']).max()['Sepal.Length']
df_gb['mean Sepal.Width)'] = df.groupby(['Species']).mean()['Sepal.Width']                        
df_gb['median Petal.Width'] = df.groupby(['Species']).median()['Petal.Width']
df_gb['min Petal.Length'] = df.groupby(['Species']).min()['Petal.Length']                        
df_gb                        

           max Sepal.Length mean Sepal.Width)   median Petal.Width  min Petal.Length
Species             
setosa                 5.8             3.428                    0.2              1.0
versicolor             7.0             2.770                    1.3              3.0
virginica              7.9             2.974                    2.0              4.5

plyr or dplyr in Python

7 Answers7