6

Is there a Python pandas function similar to R's dplyr::mutate(), which can add a new column to grouped data by applying a function on one of the columns of the grouped data? Below is the detailed explanation of the problem:

I generated sample data using this code:

x <- data.frame(country = rep(c("US", "UK"), 5), state = c(letters[1:10]), pop=sample(10000:50000,10))

Now, I want to add a new column which has maximum population for US and UK. I can do it using following R code...

x <- group_by(x, country)
x <- mutate(x,max_pop = max(pop))
x <- arrange(x, country)

...or equivalently, using the R dplyr pipe operator:

x %>% group_by(country) %>% mutate(max_pop = max(pop)) %>% arrange(country)

So my question is how do I do it in Python using pandas? I tried following but it did not work

x['max_pop'] = x.groupby('country').pop.apply(max)
smci
  • 32,567
  • 20
  • 113
  • 146
saurav shekhar
  • 596
  • 1
  • 6
  • 17
  • 2
    No piping? One of dplyr's signature methods: `x %>% group_by(country) %>% mutate(max_pop = max(pop)) %>% arrange(country)`...somewhere an R programmer is crying a little! – Parfait Dec 14 '16 at 20:20
  • 1
    yeah, but i feel more comfortable without pipe operator – saurav shekhar Dec 14 '16 at 20:26
  • 1
    I understand. You will in time. At first, I hated R's apply family. Just leave me my `for` and `while` loops. They were so hard to understand or write. Now I love lapply, mapply, vapply, sapply -methods Python's pandas lacks (without custom workarounds). – Parfait Dec 15 '16 at 14:12
  • 1
    But apply functions provide some serious performance advantage over `for` and `while` loops. They are much faster. I am not sure if that is the case with piping. **Please let me know if piping is faster than the conventional method**. – saurav shekhar Dec 15 '16 at 17:23
  • 1
    That's actually a misnomer. Apply functions are just loops underneath, i.e., [syntactic sugar](http://stackoverflow.com/questions/2275896/is-rs-apply-family-more-than-syntactic-sugar). They are not always advantageous over traditional looping. I like them because they return a list/vector/matrix of equal length to input as other loops do not necessarily return objects. – Parfait Dec 15 '16 at 17:48
  • 1
    As for piping, it may not be faster but some argue it makes code more compact and you avoid new variables or reassigning old variables. – Parfait Dec 15 '16 at 17:49
  • Thanks for clarifying – saurav shekhar Dec 15 '16 at 21:16

2 Answers2

9

you want to use transform. transform will return an object with the same index as what's being grouped which makes it easy to assign back as a new column in that object if it's a dataframe.

x['max_pop'] = x.groupby('country').pop.transform('max')

Setup

import pandas as pd 

x = pd.DataFrame(dict(
    country=['US','UK','US','UK'],
    state=['a','b','c','d'],
    pop=[37088, 46987, 17116, 20484]
))
piRSquared
  • 285,575
  • 57
  • 475
  • 624
1

I have been porting data packages (dplyr, tidyr, tibble, etc) from R in python:

https://github.com/pwwang/datar

If you are familiar with those packages in R, and want to apply it in python, then it is here for you:

>>> from datar.all import (
...     c, f, tibble, rep, letters, sample, group_by, mutate, arrange, max
... )
>>> 
>>> x = tibble(
...   country=rep(c("US", "UK"), 5), 
...   state=c(letters[:10]), 
...   pop=sample(f[10000:50000], 10)
... )
>>> 
>>> x >> group_by(f.country) >> mutate(max_pop=max(f.pop)) >> arrange(f.country)
   country    state     pop  max_pop
  <object> <object> <int64>  <int64>
0       UK        b   48496    49290
1       UK        d   49290    49290
2       UK        f   46748    49290
3       UK        h   43078    49290
4       UK        j   20552    49290
5       US        a   29046    45070
6       US        c   22936    45070
7       US        e   44238    45070
8       US        g   12995    45070
9       US        i   45070    45070

[Groups: country (n=2)]
Panwen Wang
  • 3,573
  • 1
  • 18
  • 39