I am trying to figure out how to use the Pandas to find the number of unique values in a DataFrame that has been grouped by two of its columns. My sample dataset looks like this:
df = pd.DataFrame(
{"Header1" : [0,1,2,3,0,1,2,3],
"Header2" : [0,1,0,1,0,1,0,1],
"values" : [1,2,3,4,1,3,2,1]}
)
I would love to be able to transform this to return a DataFrame that looks like this
output = pd.DataFrame(
{"Header1" : [0,1,2,3],
"Header2" : [0,1,0,1],
"unique values" : [1,2,2,2]}
)
So far what I have tried is using groupby
and nunique
:
pd_series = df.groupby(['Header1', 'Header2'])['values'].nunique()
This returns the right answer but in a multi-indexed series data format that is very tricky to convert to a usable DataFrame. I've spent quite a lot of time trying to figure out how to correctly format the output with no luck. Instead of generating a DataFrame with the correct set of columns, pd_series.to_frame()
produces a DataFrame with a single column named "values" with one row that contains the entire series object.
So far I am resorting to copy-pasting the results from nunique()
into a new DataFrame. There must be a better way. Does anyone have any suggestions for how to do better with this?