0

I have a dataframe.

df = ...
gb = df.groupby(['A','B','C'],axis=0)

new_df = gb.agg(['sum','max']) 


def func(expected_list: List[str]):
    ...
    # fails if a list of strings is not passed

func(list(new_df.columns))

And things fail.

new_df.columns yields: [ ('A','max'), ('A','sum'), ('B','max'), ... ]

How does one convert an aggregate groupby call to a dataframe that returns a columns Series with a call to columns, as expected, rather than this tuple-quasi-tensor?

Chris
  • 28,822
  • 27
  • 83
  • 158
  • do something like `new_df.columns = ['_'.join(x) for x in new_df.columns]`. Thats a simple way to collapse a MultiIndex. Probably close as a duplicate of: https://stackoverflow.com/questions/14507794/pandas-how-to-flatten-a-hierarchical-index-in-columns? – ALollz Mar 27 '20 at 20:38
  • What do you mean incorrect? You haven't provided any expected output and that will flatten the MultiIndex given those tuples... And your function is a complete black box. You say it needs strings, but who knows the myriad of ways it could fail. – ALollz Mar 27 '20 at 20:51
  • No it's really not. the columns attribute is `Immutable ndarray implementing an ordered, sliceable set`. Iteration over an ordered object **preserves** order, so iterating and assigning back **is safe**. It's **immutable** so you can't change it without re-creating it. I mean feel free to use `df.set_axis(that_list), axis=1)` but it's just more verbose. – ALollz Mar 27 '20 at 20:57
  • The behavior is clear; if you agg with a list of functions it returns a MultiIndex. If you agg with a single function it returns a normal Index. You can/should either write `func` to properly handle either an Index or a MultiIndex, or you can use the myriad of other aggregate tools, like [NamedAggregations](https://pandas-docs.github.io/pandas-docs-travis/user_guide/groupby.html#named-aggregation) to explicitly force a simple Index. – ALollz Mar 27 '20 at 21:25
  • `pandas` is a powerful tool, capable of efficiently transforming billions of rows of data with self-documenting syntax. That power does come at a cost; there's a large amount of optimization behind the scenes. Most of the functions are written in cython, the behavior you find in that example is because of that. SO is not the place to debate how pandas is designed, if you want I suggest trying the python chat room, or github. But if you want to learn about pandas, just ask new questions (like the above), many of the users of pandas tag are very kind and knowledgeable. That's how I learned. – ALollz Mar 30 '20 at 21:24

1 Answers1

0

So it turns out that the returned index is a different type known as a MultiIndex, making the returned type of agg not a DataFrame(Index) but a DataFrame(MultiIndex) in pseudo-code, and you can return a Dataframe(Index) which has the expected behavior of a DataFrame output by read_csv for all downstream code with the following procedure:

result.columns = [re.sub(r'_$','',-'_'.join(col)) for col in result.columns]
Chris
  • 28,822
  • 27
  • 83
  • 158