I know that the title is a bit ambiguity. Please read for more detail.
Input
I have a known number of sets (like 10000) with variable length, each of them is a subset of the English alphabet. It looks like this:
a = ['a', 'b', 'c', 'a']
b = ['c', 'd', 'a', 'b']
c = ['x', 'y', 'z']
....
unique_value = set((*a, *b, *c, ...))
# {'a', 'b', 'c', 'd', 'e', 'f', ..., 'u', 'v', 'w', 'x', 'y', 'z'}
What I need
I need to choose a fix number of set (like 100) from those above 10000 set, in which this subset contains all English characters, and the count of each character is as balance
as possible. balance
mean the character distribution is uniform. I know it's hard to pick a perfectly uniform distribution, so defining a balance criteria
is also important.
My question
- How to pick the subset (with properties as above) from the original set
- Definition of a balance criteria
Please suggest me a way to achieve this. Any advice will be gratefully appreciated.
Thanks in advance!