Generate lists from dataframe with even representation between multiple categorical variables

Question

I am attempting to define groups out of a DF. These groups must be as similar as possible based on categorical variables.

For example, I have 10 marbles, and need to make 3 groups. 4 of my marbles are blue, 2 are yellow, 4 are white.

10 marbles will not divided evenly into 3 groups, so group sizes will be 4,3,3, aka as close to even as possible

Likewise, the colors will not have even representation between groups since we only have 2yellow. However, those yellow marbles must be distributed across groups as evenly as possible. This will continue across all categorical variables in the data set.

My original plan was to just check for presence in other groups for that row and if in a group, try another group. My co worker pointed out a better way of generating groups, scoring them with one hot encoding, and then swapping rows until the sums from one hot encoding approach similar levels (indicating the rows contain a "close to representative" variation of categorical variables within each group.) His solution is the posted answer.

import pandas as pd
import numpy as np
test = pd.DataFrame({'A' : ['alice', 'bob', 'george', 'michael', 'john', 'peter', 'paul', 'mary'], 
                 'B' : ['one', 'one', 'two', 'three', 'two', 'two', 'one', 'three'],
                 'C' : ['dog', 'cat', 'dog', 'cat', 'dog', 'cat', 'dog', 'cat'],
                 'D' : ['boy', 'girl', 'boy', 'girl', 'boy', 'girl', 'boy', 'girl']})
gr1, gr2, gr3 = [], [], []
gr1_names = []
def test_check1(x):

    #this is where I'm clearly not approaching this problem correctly
    for index, row in x.iterrows():
        if row['A'] not in gr1 and row['B'] not in gr1 and row['C'] not in gr1 and row['D'] not in gr1:
                 gr1.extend(row) # keep a record of what names are in what groups
                 gr1_names.append(row['A']) #save the name

But just coming here I also need to be able to say "well if the row wasn't allowed into ANY groups just toss it into the first one. Then, the next time the row wasn't allowed into ANY groups just toss it in the second one" and so on.

I can see that my sample code does not adequately handle that situation.

I tried a random number gen and then making bins and honestly this was pretty close, but I was hoping to find a non random answer.

Here are some links which I have believed to be helpful as I worked on this today: How to get all possible combinations of a list’s elements?

Get unique combinations of elements from a python list

Randomly reassign participants to groups such that participants originally from same group don't end up in same group ---this one feels very close but I can't figure out how to manipulate it into what I need---

How to generate lists from a specification of element combinations

Expected output would be a dataframe in any shape, but a pivot of said dataframe would indicate:

group id    foo bar faz
       1    3   2   5
       2    3   2   5
       3    3   1   5
       4    4   1   5

After having spent some time working and refining code and the problem, I have found a paper from 20 years ago that perfectly states the problem and suggests avenues for solving it. https://s3.amazonaws.com/academia.edu.documents/46548639/Creating_student_groups_with_similar_cha20160616-12316-ace99n.pdf?AWSAccessKeyId=AKIAIWOWYYGZ2Y53UL3A&Expires=1502743057&Signature=zKTfTCHEehucr2Oie%2FwgYmP8Ydg%3D&response-content-disposition=inline%3B%20filename%3DCreating_student_groups_with_similar_cha.pdf — Dylan, Aug 14 '17 at 20:11

score 0 · Accepted Answer · answered Aug 08 '17 at 14:07

My co worker has found a solution, and the solution I think better explains the problem as well.

import pandas as pd
import random
import math
import itertools

def n_per_group(n, n_groups):
    """find the size of each group when splitting n people into n_groups"""
    n_per_group = math.floor(n/n_groups)
    rem = n % n_per_group
    return [n_per_group if k<rem else n_per_group + 1 for k in range(n_groups)]

def assign_groups(n, n_groups):
    """split the n people in n_groups pretty evenly, and randomize"""
    n_per = n_per_group(n ,n_groups)
    groups = list(itertools.chain(*[i[0]*[i[1]] for i in zip(n_per,list(range(n_groups)))]))
    random.shuffle(groups)
    return groups

def group_diff(df, g1, g2):
    """calculate the between group score difference"""
    a = df.loc[df['group']==g1, ~df.columns.isin(('A','group'))].sum()
    b = df.loc[df['group']==g2, ~df.columns.isin(('A','group'))].sum()
    #print(a)
    return abs(a-b).sum()

def swap_groups(df, row1, row2):
    """swap the groups of the people in row1 and row2"""
    r1group = df.loc[row1,'group']
    r2group = df.loc[row2,'group']
    df.loc[row2,'group'] = r1group
    df.loc[row1,'group'] = r2group
    return df

def row_to_group(df, row):
    """get the group associated to a given row"""
    return df.loc[row,'group']

def swap_and_score(df, row1, row2):
    """
    given two rows, calculate the between group scores
    originally, and if we swap rows. If the score difference
    is minimized by swapping, return the swapped df, otherwise
    return the orignal (swap back)
    """
    #orig = df
    g1 = row_to_group(df,row1)
    g2 = row_to_group(df,row2)
    s1 = group_diff(df,g1,g2)
    df = swap_groups(df, row1, row2)
    s2 = group_diff(df,g1,g2)
    #print(s1,s2)
    if s1>s2:
        #print('swap')
        return df
    else:
        return swap_groups(df, row1, row2)

def pairwise_scores(df):
    d = []
    for i in range(n_groups):
        for j in range(i+1,n_groups):
            d.append(group_diff(df,i,j))
    return d

# one hot encode and copy
df_dum = pd.get_dummies(df, columns=['B', 'C', 'D']).copy(deep=True)

#drop extra cols as needed

groups = assign_groups(n, n_groups)
df_dum['group'] = groups

# iterate
for _ in range(5000):
    rows = random.choices(list(range(n)),k=2)
    #print(rows)
    df_dum = swap_and_score(df_dum,rows[0],rows[1])
    #print(pairwise_scores(df))

print(pairwise_scores(df_dum))

df['group'] = df_dum.group
df['orig_groups'] = groups

for i in range(n_groups):
        for j in range(i+1,n_groups):
            a = df_dum.loc[df_dum['group']==3, ~df_dum.columns.isin(('A','group'))].sum()
            b = df_dum.loc[df_dum['group']==0, ~df_dum.columns.isin(('A','group'))].sum()
            print(a-b)

I will be changing the question itself to better explain what was needed, since I think I did not explain the end goal particularly well the first time around.

Generate lists from dataframe with even representation between multiple categorical variables

1 Answers1