0

I have a dataframe (df) and wish to obtain the largest counts of "NCT_ID" (not unique values only, but every occurrence) with respect to columns "COUNTRY" and "CONDITION". So that for each country in "COUNTRY", I will have the n (set n = 2 for simplicity) most common conditions in "CONDITION", sorted by largest. The df has the following structure (All columns vary in values, including "COUNTRY", this is just a small subset):

    NCT_ID      CONDITION                   COUNTRY
0   NCT00000261 Substance-Related Disorders United States
1   NCT00000262 Opioid-Related Disorders    United States
2   NCT00000263 Substance-Related Disorders United States
3   NCT00000263 Substance-Related Disorders United States
4   NCT00000264 Heart disease               Canada
5   NCT00000264 Heart disease               Canada
6   NCT00000267 Heart disease               Canada
7   NCT00000264 Cancer                      Canada
8   NCT00000268 Cancer                      Canada

Which you can load as follows:

import pandas as pd

df = pd.DataFrame([["NCT00000261", "Substance-Related Disorders", "United States"],
                   ["NCT00000262", "Opioid-Related Disorders", "United States"],
                   ["NCT00000263", "Substance-Related Disorders", "United States"],
                   ["NCT00000263", "Substance-Related Disorders", "United States"],
                   ["NCT00000264", "Heart disease", "Canada"],
                   ["NCT00000264", "Heart disease", "Canada"],
                   ["NCT00000267", "Heart disease", "Canada"],
                   ["NCT00000264", "Cancer", "Canada"],
                   ["NCT00000268", "Cancer", "Canada"]
                  ],
                  columns=["NCT_ID", "CONDITION", "COUNTRY"]
                 )

So I am hoping for an end result that will look something like the following:

    COUNTS  CONDITION                   COUNTRY
0   3       Substance-Related Disorders United States
0   1       Opioid-Related Disorders    United States
1   3       Heart disease               Canada
1   2       Cancer                      Canada

This final df should show the n most common conditions, in the n countries with largest overall counts (of count of conditions combined). What I have done so far: Following https://stackoverflow.com/a/17679517/7445528, I have experimented with:

# df_combined = df_combined.groupby(['COUNTRY', 'CONDITION']).size()
# df_combined = df_combined.groupby(['COUNTRY', 'CONDITION']).size().groupby(level=0).max()
# df_combined = df_combined.groupby(['COUNTRY', 'CONDITION']).size().reset_index().groupby('COUNTRY')[[0]].max()

But this does not get the right dataframe result. To see the entire project so far: https://github.com/Gustav-Rasmussen/AACT-Analysis/tree/master

Gustav Rasmussen
  • 3,720
  • 4
  • 23
  • 53
  • 2
    try this, ``df.groupby(['CONDITION', "COUNTRY"]).size().reset_index(name='COUNTS')`` – sushanth Jun 14 '20 at 10:47
  • Thank you @Sushanth , it gives me the count column, but it does not get the top countries first, or top conditions .. – Gustav Rasmussen Jun 14 '20 at 10:50
  • 1
    ``df.groupby(['CONDITION', "COUNTRY"]).size().reset_index(name='COUNTS').sort_values(by=[ 'COUNTRY', 'COUNTS'], ascending=False)`` yields what u need. – sushanth Jun 14 '20 at 10:59
  • Nice, @Sushanth , this is the correct dataframe structure I wish, just with the final feature lacking: I wish to take just the n largest condition frequencies for the n most represented countries (set n = 2 for simplicity) – Gustav Rasmussen Jun 14 '20 at 11:01
  • 1
    What will be the expected output for the above case. – sushanth Jun 14 '20 at 11:05
  • For the above subset, I wish for 2 entries for each country. In the entire dataframe however, there could be more entries that should not be included if their frequencies are not within the top n – Gustav Rasmussen Jun 14 '20 at 11:07

2 Answers2

1
new_df = df.groupby(['CONDITION', 'COUNTRY']).apply(len).reset_index(name='COUNTS')

new_df.sort_values(by='COUNTS', axis=0, inplace=True, ascending=False)
warped
  • 8,947
  • 3
  • 22
  • 49
  • Thank you @warped , this is really close to what I want. I am now getting the sorted condition-frequencies by country. The final step, I wish to take just the n largest condition frequencies for the n most represented countries (set n = 2 for simplicity) – Gustav Rasmussen Jun 14 '20 at 10:58
  • 1
    df.groupby('COUNTRY').head(2) worked after following your answer – Gustav Rasmussen Jun 14 '20 at 11:13
0

try this:

df.groupby(['CONDITION','COUNTRY']).count().rename(columns={'NCT_ID':'COUNT'}).reset_index().sort_values(by='COUNT', ascending=False)
````
Sandeep Kothari
  • 405
  • 3
  • 6