Sorting by composite cohorts stored in different columns

Question

Suppose I have a data frame, and 3 of its columns contain parameters. Thus all possible cohorts are combinations: suppose the columns have "duck", "cat", "dog", another "white", "grey", "black", the last one "small", "big". Thus, cohorts are small grey duck, big black cat, and so on.

Thus, how can group the data frame by all possible cohorts stored therein? I will need this to calculate certain values for all cohorts. Thank you.

`DataFrame` has `groupby()`. Is `data_frame.groupby(column_list).(...)`, e. g. `data_frame.groupby(["animal", "colour", "size"]).max("age")`, not doing what you need? — valid, Dec 19 '20 at 14:55

tgrandje · Answer 1 · 2021-01-06T16:51:50.427

0

Somehow you have to split your series into unique elements (whether those are lists or pandas.Series).

For example :

cols1 = set(df['col1'])
cols2 = set(df['col2'])
cols3 = set(df['col3'])
df = pd.DataFrame([[x,y,z] for x in cols1 for y in cols2 for z in cols3], columns=['col1', 'col2', 'col3'])
df.drop_duplicates(keep="first", inplace=True)

You could also do this using a loop (looping over df.columns.tolist()) and storing series values into a dict.

(This help you gather all the possible combinations of cohorts)

EDIT :

To group combinations in you first dataframe, you could do something like this :

df.groupby(['col1', 'col2', 'col3'])['col4'].mean()

See pandas.groupby's doc

EDIT 2 :

Another way to get all your combinations could be (using pandas only) :

from functools import reduce
cols = [df[[col]].drop_duplicates(keep='first') for col in ['col1', 'col2', 'col3']]
for sub_df_col in cols:
  sub_df_col['CARTESIAN_PRODUCT'] = 1 #you now have a new column which the same value everywhere
df2 = reduce(lambda  left,right: pd.merge(left,right,on=['CARTESIAN_PRODUCT'], how='outer'), cols).drop('CARTESIAN_PRODUCT', axis=1)

Credits to @everestial007 for the reduce/lambda solution

edited Jan 06 '21 at 16:51

answered Dec 19 '20 at 14:21

tgrandje

2,332
11
33

thank You. thus, I am supposed to loop myself a new data frame, where unique values from the columns are. But will I be able a.) find out all combinations of cohorts, as I need to know all possible combinations of values from all 3 columns? My cohorts need to be composite; b.) will I be able to calculate anything in my new data frame, I will need to calculate data in the original data frame. – Versteher Dec 19 '20 at 14:50
This answer does not yet contain the grouping that OP asked for. – valid Dec 19 '20 at 14:59
@valid : can't figure whats is asked about this grouping operation without any sample though... But you're right – tgrandje Dec 19 '20 at 15:00
@Versteher : just edited it, but I'm starting to suspect I did not understood your question at first... If groupby is the answer you were looking for, let me know and I will delete this post. – tgrandje Dec 19 '20 at 15:06
1

@Versteher When you say "all possible combinations/cohorts", do you mean "the theoretical extent/powerset" or do you mean "all combinations that can be observed in the actual dataframe"? – valid Dec 19 '20 at 15:06
groupby cannot be a solution, in my understanding of it. The core word, for which I can't find an answer in cohort analysis resources in the net, is "composite". Suppose, I have a three columns: countries, gender and OS. Thus all possible cohorts are: "US male Windows", "France female Linux" etc. Afterwards I will need to calculate xcertain values for all of the cohorts. thus the question is in three steps: – Versteher Dec 22 '20 at 09:15
1.) what is combinatorics formula to calculate all cohorts. If there were 1 set, it would be 3!/(set - 3)!. but we have three separate sets (countries, gender, OS), that's why I call cohorts composite. 2.) how can review all (composite) cohorts that actually are given in the dataset? 3.) then how will I calculate values for eacxh cohort? – Versteher Dec 22 '20 at 09:19
exempli gratia, `````list = [['US', 'male', 'Linux'], ['UK', 'female', 'Windows'], ['UAE', 'male', 'Android'], ['KSA', 'male', 'ios'], ['USA', 'female', 'Windows']] df = pd.DataFrame(list, columns = ['country', 'gender', 'OS']) ```` this DF has composite cohorts: each cohort consists og 3 elements – Versteher Dec 22 '20 at 09:31
That's exactly what my first answer was : recreating a new dataframe with all combinations. If you want to compute any calculation afterwards, you may merge the new dataframe to the old one (beeing carefull to use different names for your variables : for example `df_new = df_combinations.merge(df_init, on=['col1', 'col2', 'col3'], how="left")`). Though any "groupby" applied to the final dataframe won't give you any result for combination not there at first... – tgrandje Dec 22 '20 at 19:16
in fact, from the point of maths, a solution of finding ALL possible combinations does indeed exists: if we have m values of one cohorts, n values of the other cohort, thus a number of all their possible combinations is m*n. This m*n has to be distributed over a value of entries in the table, thus a number of all possible composite cohorts is (multiplication of values of each cohort) to the power of the number of values of entries. – Versteher Jan 06 '21 at 11:13
But this does not help much in practise, since we usually want to know not all possible combinations, but all ones given in a database. I suppose a pandas method should exist that would run through all columns, find all composite cohorts (combination of cohorts present in the database) and advise, which of them occur in the database – Versteher Jan 06 '21 at 11:14
I disagree : pandas is just perfect the way it is, it's meant to work on matrix, not to shuffle each column independantly. If you want to find every combination, you have to work on each column on your own (using those as a vector, a list or a set, whatever). This is what I did in my first example using sets and list comprehension. I'm pretty sure this is efficient enought, but I'll add another way using only vectors, it may be easier for you to understand – tgrandje Jan 06 '21 at 16:40
Just added the solution working with pandas only. It will probably be faster on big datasets, even though I'm not sure of how much... – tgrandje Jan 06 '21 at 16:53

Sorting by composite cohorts stored in different columns

1 Answers1