2

I have a pandas DataFrame like this:

n = 6000
my_data = DataFrame ({
    "Category"  : np.random.choice (['cat1','cat2'], size=n) ,
    "val_1"     : np.random.randn(n) ,
    "val_2"     : [i for i in range (1,n+1)]
})

I am aggregating on Category, and applying different functions to different columns, like so:

counts_and_means = \
    my_data.groupby("Category").agg (
        {
            "Category"  : np.count_nonzero ,
            "val_1"     : np.mean ,
            "val_2"     : np.mean
        }
    )

After this finishes, I want an explicit column ordering and new column names. I do that with reindex and rename, chaining them with the original aggregation in a fluent style, like so:

counts_and_means = \
    my_data.groupby("Category").agg (
        {
            "Category"  : np.count_nonzero ,
            "val_1"     : np.mean ,
            "val_2"     : np.mean
        }
    ) \
    .reindex (columns = ["Category","val_1","val_2"]) \
    .rename (
        columns = {
            "Category" : "Count" ,
            "val_1"    : "Avg. Val_1" ,
            "val_2"    : "Avg. Val_2" ,
        }
    )

Is this the best way (in terms of idiom, performance, etc.)? Or is there a way to explicitly specify the column names and ordering right in the agg(...) step?

I am asking because I am new to the idioms of this API and want to get them right, and because it looks like reindex and rename both create DataFrame copies, which could be a bigger issue with large data sets (I am aware of the inplace parameter for rename, but that wouldn't work in my fluent setup). Any help/advice is greatly appreciated.

sparc_spread
  • 10,643
  • 11
  • 45
  • 59

1 Answers1

2

It's worth noting in Python 3.3+ the order of the dict is not garaunteed (and it's not even going to be the same on each call):

In [11]: counts_and_means = \
    my_data.groupby("Category").agg (
        {
            "Category"  : np.count_nonzero ,
            "val_1"     : np.mean ,
            "val_2"     : np.mean
        }
    )

In [12]: counts_and_means
Out[12]:
                val_2  Category     val_1
Category
cat1      2972.181788      3009  0.005821
cat2      3028.988633      2991  0.027436

One way around this is to use OrderedDict:

In [13]: from collections import OrderedDict

In [14]: counts_and_means = \
    my_data.groupby("Category").agg(
        OrderedDict([
            ("Category", np.count_nonzero),
            ("val_1",    np.mean),
            ("val_2",    np.mean)
        ])
    )

In [15]: counts_and_means
Out[15]:
          Category     val_1        val_2
Category
cat1          3009  0.005821  2972.181788
cat2          2991  0.027436  3028.988633

Now you can rename directly with the .columns attribute:

In [16]: counts_and_means.columns = ["Count", "Avg_val1", "Avg_val2"]

Note: IMO using .s in column names is not pandaroble, since you can no longer access the columns as DataFrame attributes. Try and keep these are identifiers where possible.


An alternative, which I would say may be more idiomatic* and somewhat less verbose, is to build this iteratively:

In [21]: g = my_data.groupby("Category")

In [22]: counts_and_means = g["Category"].agg(np.count_nonzero).to_frame(name="Count")
         counts_and_means["Avg_val1"] = g["val_1"].agg("mean")
         counts_and_means["Avg_val2"] = g["val_2"].agg("mean")

In [23]: counts_and_means
Out[23]:
          Count  Avg_val1     Avg_val2
Category
cat1       3009  0.005821  2972.181788
cat2       2991  0.027436  3028.988633

*You don't need to do everything in one line!! :)

Andy Hayden
  • 359,921
  • 101
  • 625
  • 535
  • FYI the reason I am trying to do all in one line is I'm trying to avoid the unnecessary creation of temporary datasets, which for large datasets could (I guess?) become a performance issue. Am I overthinking this / prematurely optimizing? Anyway, your solution is good because getting the column order right at creation means just one creation. The renaming can be done in place, as you've shown. One last question, speaking of that: what is the difference between `counts_and_means.columns = ....` and `counts_and_means.rename (...... , inplace=True)`? – sparc_spread Apr 22 '15 at 16:54
  • 1
    "Am I overthinking this / prematurely optimizing?" Yes. The implementation of DataFrames is very efficient for this, and doing the agg internally is creating copies just like this. You shouldn't worry *unless* it's a (performance) problem. `.columns` is more efficient as it just replaces one list with another, rename has to lookup each entry and replace it (useful sometimes). Be warned, using the inplace flag often creates a copy under the hood. – Andy Hayden Apr 22 '15 at 17:56
  • 1
    Generally the best way to determine which way is more efficient is to %timeit (and see the difference) for a few different example data sets / sizes. Accept no substitute. – Andy Hayden Apr 22 '15 at 17:57
  • From Python 3.6 (formally 3.7) dicts are insertion-ordered. However, I've not edited the answer since it would all need to be rewritten to take those changes into account. Short version: OrderedDict not needed, use dict. :) – creanion May 14 '22 at 08:43