0

I have two dataframes: tr is a training-set, ts is a test-set. They contain columns uid (a user_id), categ (a categorical), and response. response is the dependent variable I'm trying to predict in ts.

I am trying to compute the mean of response in tr, broken out by columns uid and categ:

avg_response_uid_categ = tr.groupby(['uid','categ']).response.mean()

This gives the result but (unwantedly) the dataframe index is a MultiIndex. (this is the groupby(..., as_index=True) behavior):

MultiIndex[--5hzxWLz5ozIg6OMo6tpQ  SomeValueOfCateg, --65q1FpAL_UQtVZ2PTGew  AnotherValueofCateg, ...

But instead I want the result to keep the two columns 'uid', 'categ' and keep them separate.

Should I use aggregate() instead of groupby()? Trying groupby(as_index=False) is useless.

smci
  • 32,567
  • 20
  • 113
  • 146

1 Answers1

0

The result seems to differ depending on whether you do:

tr.groupby(['uid','categ']).response.mean()

or:

tr.groupby(['uid','categ'])['response'].mean()  # RIGHT 

i.e. whether you slice a single Series, or a DataFrame containing a single Series. Related: Pandas selecting by label sometimes return Series, sometimes returns DataFrame

smci
  • 32,567
  • 20
  • 113
  • 146