I have a dataframe (df) and wish to obtain the largest counts of "NCT_ID" (not unique values only, but every occurrence) with respect to columns "COUNTRY" and "CONDITION". So that for each country in "COUNTRY", I will have the n (set n = 2 for simplicity) most common conditions in "CONDITION", sorted by largest. The df has the following structure (All columns vary in values, including "COUNTRY", this is just a small subset):
NCT_ID CONDITION COUNTRY
0 NCT00000261 Substance-Related Disorders United States
1 NCT00000262 Opioid-Related Disorders United States
2 NCT00000263 Substance-Related Disorders United States
3 NCT00000263 Substance-Related Disorders United States
4 NCT00000264 Heart disease Canada
5 NCT00000264 Heart disease Canada
6 NCT00000267 Heart disease Canada
7 NCT00000264 Cancer Canada
8 NCT00000268 Cancer Canada
Which you can load as follows:
import pandas as pd
df = pd.DataFrame([["NCT00000261", "Substance-Related Disorders", "United States"],
["NCT00000262", "Opioid-Related Disorders", "United States"],
["NCT00000263", "Substance-Related Disorders", "United States"],
["NCT00000263", "Substance-Related Disorders", "United States"],
["NCT00000264", "Heart disease", "Canada"],
["NCT00000264", "Heart disease", "Canada"],
["NCT00000267", "Heart disease", "Canada"],
["NCT00000264", "Cancer", "Canada"],
["NCT00000268", "Cancer", "Canada"]
],
columns=["NCT_ID", "CONDITION", "COUNTRY"]
)
So I am hoping for an end result that will look something like the following:
COUNTS CONDITION COUNTRY
0 3 Substance-Related Disorders United States
0 1 Opioid-Related Disorders United States
1 3 Heart disease Canada
1 2 Cancer Canada
This final df should show the n most common conditions, in the n countries with largest overall counts (of count of conditions combined). What I have done so far: Following https://stackoverflow.com/a/17679517/7445528, I have experimented with:
# df_combined = df_combined.groupby(['COUNTRY', 'CONDITION']).size()
# df_combined = df_combined.groupby(['COUNTRY', 'CONDITION']).size().groupby(level=0).max()
# df_combined = df_combined.groupby(['COUNTRY', 'CONDITION']).size().reset_index().groupby('COUNTRY')[[0]].max()
But this does not get the right dataframe result. To see the entire project so far: https://github.com/Gustav-Rasmussen/AACT-Analysis/tree/master