I am attempting to define groups out of a DF. These groups must be as similar as possible based on categorical variables.
For example, I have 10 marbles, and need to make 3 groups. 4 of my marbles are blue, 2 are yellow, 4 are white.
10 marbles will not divided evenly into 3 groups, so group sizes will be 4,3,3, aka as close to even as possible
Likewise, the colors will not have even representation between groups since we only have 2yellow. However, those yellow marbles must be distributed across groups as evenly as possible. This will continue across all categorical variables in the data set.
My original plan was to just check for presence in other groups for that row and if in a group, try another group. My co worker pointed out a better way of generating groups, scoring them with one hot encoding, and then swapping rows until the sums from one hot encoding approach similar levels (indicating the rows contain a "close to representative" variation of categorical variables within each group.) His solution is the posted answer.
import pandas as pd
import numpy as np
test = pd.DataFrame({'A' : ['alice', 'bob', 'george', 'michael', 'john', 'peter', 'paul', 'mary'],
'B' : ['one', 'one', 'two', 'three', 'two', 'two', 'one', 'three'],
'C' : ['dog', 'cat', 'dog', 'cat', 'dog', 'cat', 'dog', 'cat'],
'D' : ['boy', 'girl', 'boy', 'girl', 'boy', 'girl', 'boy', 'girl']})
gr1, gr2, gr3 = [], [], []
gr1_names = []
def test_check1(x):
#this is where I'm clearly not approaching this problem correctly
for index, row in x.iterrows():
if row['A'] not in gr1 and row['B'] not in gr1 and row['C'] not in gr1 and row['D'] not in gr1:
gr1.extend(row) # keep a record of what names are in what groups
gr1_names.append(row['A']) #save the name
But just coming here I also need to be able to say "well if the row wasn't allowed into ANY groups just toss it into the first one. Then, the next time the row wasn't allowed into ANY groups just toss it in the second one" and so on.
I can see that my sample code does not adequately handle that situation.
I tried a random number gen and then making bins and honestly this was pretty close, but I was hoping to find a non random answer.
Here are some links which I have believed to be helpful as I worked on this today: How to get all possible combinations of a list’s elements?
Get unique combinations of elements from a python list
Randomly reassign participants to groups such that participants originally from same group don't end up in same group ---this one feels very close but I can't figure out how to manipulate it into what I need---
How to generate lists from a specification of element combinations
Expected output would be a dataframe in any shape, but a pivot of said dataframe would indicate:
group id foo bar faz
1 3 2 5
2 3 2 5
3 3 1 5
4 4 1 5