0

I have a dataframe in the following form:

group base height weight size
0      A     10     5     M
0      A     20     5     M
1      A     10     10    S
2      A      5      5    L

How can I get a matrix, which counts the similiar entries by group? The output looks like this:

compare  base height weight size
0,1        3/3  2/3   2/3   2/3
0,2        3/3  0/3   3/3   2/3
1,2        2/2  0/2   0/2   0/2
PV8
  • 5,799
  • 7
  • 43
  • 87

1 Answers1

1
  • at the core of the solution most frequent values
  • use itertools.combinations to get valid pairs of groups
  • compare most frequent to each row in combination of groups. sum() truth matrix to find number of matches
  • rest is formating
df = pd.read_csv(io.StringIO("""group base height weight size
0      A     10     5     M
0      A     20     5     M
1      A     10     10    S
2      A      5      5    L"""), sep="\s+")

# columns we're working with
cols = [c for c in df.columns if c!= "group"]

# iterate over combinations of groups
dfx = pd.DataFrame()
for gp in itertools.combinations(df.group.unique(), 2):
    dfg = df.loc[df.group.isin(gp),cols]
    dfx = pd.concat([dfx, 
                     (dfg == dfg.value_counts().index[0])
                     .sum().to_frame().T.assign(gs=len(dfg), compare=",".join(str(e) for e in gp))
                    ])
# rebase 1 as 0 for comparisons
dfx = dfx.reset_index(drop=True).replace(1,0).astype(str)
# format as required
dfx.loc[:,cols] = dfx[cols].apply(lambda x: x+" / " +dfx["gs"])

dfx.drop(columns="gs")

base height weight size compare
0 3 / 3 2 / 3 2 / 3 2 / 3 0,1
1 3 / 3 0 / 3 3 / 3 0 / 3 0,2
2 2 / 2 0 / 2 0 / 2 0 / 2 1,2
Rob Raymond
  • 29,118
  • 3
  • 14
  • 30