0

I am trying to figure out how to use the Pandas to find the number of unique values in a DataFrame that has been grouped by two of its columns. My sample dataset looks like this:

df = pd.DataFrame(
{"Header1" : [0,1,2,3,0,1,2,3], 
 "Header2" : [0,1,0,1,0,1,0,1],
 "values" : [1,2,3,4,1,3,2,1]}
)

I would love to be able to transform this to return a DataFrame that looks like this

output = pd.DataFrame(
{"Header1" : [0,1,2,3], 
"Header2" : [0,1,0,1],
"unique values" : [1,2,2,2]}
)

So far what I have tried is using groupby and nunique:

pd_series = df.groupby(['Header1', 'Header2'])['values'].nunique()

This returns the right answer but in a multi-indexed series data format that is very tricky to convert to a usable DataFrame. I've spent quite a lot of time trying to figure out how to correctly format the output with no luck. Instead of generating a DataFrame with the correct set of columns, pd_series.to_frame() produces a DataFrame with a single column named "values" with one row that contains the entire series object.

So far I am resorting to copy-pasting the results from nunique() into a new DataFrame. There must be a better way. Does anyone have any suggestions for how to do better with this?

Henry Ecker
  • 34,399
  • 18
  • 41
  • 57
  • I think you just need to set `as_index=False`, i.e. `df.groupby(['Header1', 'Header2'], as_index=False)['values'].nunique()`, no? – fsimonjetz May 19 '22 at 22:02
  • If you want to handle the rename as well you can set the `name` attribute of [Series.reset_index](https://pandas.pydata.org/docs/reference/api/pandas.Series.reset_index.html) as in [this answer](https://stackoverflow.com/a/32307259/15497888) or [this answer](https://stackoverflow.com/a/36953019/15497888) `new_df = df.groupby(['Header1', 'Header2'])['values'].nunique().reset_index(name="unique values")` – Henry Ecker May 19 '22 at 22:26

1 Answers1

0

Removing the bracket notation select of 'values' and adding .reset_index() to the end of your command will give your desired result in dataframe format.

df.groupby(['Header1', 'Header2']).nunique().reset_index()

cammyalex
  • 33
  • 4