Question
I'm trying to build a binary decision tree classifier in Python from scratch based on a data set that has only nominal attributes.
One step I'm stuck on is finding all possible ways to compute a binary split of a nominal attribute. For example, for an attribute with possible values [a, b, c, d], I am looking for a way to split these in two arrays such that we obtain:
left right
---- -----
a bcd
b acd
c abd
d abc
ab cd
ac bd
ad bc
without duplicate splits (e.g. we don't need "bc" in left
and "ad" in right
since this would yield the same binary split as "ad" in left
and "bc" in right
). Order within each split is also irrelevant (e.g. "ad" is the same as "da" in one side of a split).
Current Attempt
The exact terminology is escaping me, but I think this is some form of combination/permutation problem. I know its not quite a powerset I'm after. The closest question I could find similar to mine is linked here.
So far I've started an iterative procedure:
for i in range(1, array.length / 2):
for j in range(1, i):
# stuck here
The reason for looping only through the floor of half the length of the attribute's possible values (array
) is because if we store up to array.length / 2
values in left
, right has 1 - (array.length / 2)
values, covering all possible splits.
Also, I've heard of the itertools
library .. so perhaps there's a way to achieve what I'm after with that?