Substitute for mutate (dplyr package) in python pandas

Question

Is there a Python pandas function similar to R's dplyr::mutate(), which can add a new column to grouped data by applying a function on one of the columns of the grouped data? Below is the detailed explanation of the problem:

I generated sample data using this code:

x <- data.frame(country = rep(c("US", "UK"), 5), state = c(letters[1:10]), pop=sample(10000:50000,10))

Now, I want to add a new column which has maximum population for US and UK. I can do it using following R code...

x <- group_by(x, country)
x <- mutate(x,max_pop = max(pop))
x <- arrange(x, country)

...or equivalently, using the R dplyr pipe operator:

x %>% group_by(country) %>% mutate(max_pop = max(pop)) %>% arrange(country)

So my question is how do I do it in Python using pandas? I tried following but it did not work

x['max_pop'] = x.groupby('country').pop.apply(max)

No piping? One of dplyr's signature methods: `x %>% group_by(country) %>% mutate(max_pop = max(pop)) %>% arrange(country)`...somewhere an R programmer is crying a little! — Parfait, Dec 14 '16 at 20:20
I understand. You will in time. At first, I hated R's apply family. Just leave me my `for` and `while` loops. They were so hard to understand or write. Now I love lapply, mapply, vapply, sapply -methods Python's pandas lacks (without custom workarounds). — Parfait, Dec 15 '16 at 14:12
But apply functions provide some serious performance advantage over `for` and `while` loops. They are much faster. I am not sure if that is the case with piping. **Please let me know if piping is faster than the conventional method**. — saurav shekhar, Dec 15 '16 at 17:23
That's actually a misnomer. Apply functions are just loops underneath, i.e., [syntactic sugar](http://stackoverflow.com/questions/2275896/is-rs-apply-family-more-than-syntactic-sugar). They are not always advantageous over traditional looping. I like them because they return a list/vector/matrix of equal length to input as other loops do not necessarily return objects. — Parfait, Dec 15 '16 at 17:48
As for piping, it may not be faster but some argue it makes code more compact and you avoid new variables or reassigning old variables. — Parfait, Dec 15 '16 at 17:49

piRSquared · Accepted Answer · 2018-03-01T23:03:39.737

9

you want to use transform. transform will return an object with the same index as what's being grouped which makes it easy to assign back as a new column in that object if it's a dataframe.

x['max_pop'] = x.groupby('country').pop.transform('max')

Setup

import pandas as pd 

x = pd.DataFrame(dict(
    country=['US','UK','US','UK'],
    state=['a','b','c','d'],
    pop=[37088, 46987, 17116, 20484]
))

edited Mar 01 '18 at 23:03

answered Dec 14 '16 at 16:50

piRSquared

285,575
57
475
624

Panwen Wang · Answer 2 · 2021-06-17T18:03:33.593

I have been porting data packages (dplyr, tidyr, tibble, etc) from R in python:

https://github.com/pwwang/datar

If you are familiar with those packages in R, and want to apply it in python, then it is here for you:

>>> from datar.all import (
...     c, f, tibble, rep, letters, sample, group_by, mutate, arrange, max
... )
>>> 
>>> x = tibble(
...   country=rep(c("US", "UK"), 5), 
...   state=c(letters[:10]), 
...   pop=sample(f[10000:50000], 10)
... )
>>> 
>>> x >> group_by(f.country) >> mutate(max_pop=max(f.pop)) >> arrange(f.country)
   country    state     pop  max_pop
  <object> <object> <int64>  <int64>
0       UK        b   48496    49290
1       UK        d   49290    49290
2       UK        f   46748    49290
3       UK        h   43078    49290
4       UK        j   20552    49290
5       US        a   29046    45070
6       US        c   22936    45070
7       US        e   44238    45070
8       US        g   12995    45070
9       US        i   45070    45070

[Groups: country (n=2)]

Substitute for mutate (dplyr package) in python pandas

2 Answers2