1

I know that the title is a bit ambiguity. Please read for more detail.

Input

I have a known number of sets (like 10000) with variable length, each of them is a subset of the English alphabet. It looks like this:

a = ['a', 'b', 'c', 'a']
b = ['c', 'd', 'a', 'b']
c = ['x', 'y', 'z']
....

unique_value = set((*a, *b, *c, ...))
# {'a', 'b', 'c', 'd', 'e', 'f', ..., 'u', 'v', 'w', 'x', 'y', 'z'}

What I need

I need to choose a fix number of set (like 100) from those above 10000 set, in which this subset contains all English characters, and the count of each character is as balance as possible. balance mean the character distribution is uniform. I know it's hard to pick a perfectly uniform distribution, so defining a balance criteria is also important.

My question

  1. How to pick the subset (with properties as above) from the original set
  2. Definition of a balance criteria

Please suggest me a way to achieve this. Any advice will be gratefully appreciated.
Thanks in advance!

enamoria
  • 896
  • 2
  • 11
  • 29

1 Answers1

1

The general algorithm I would try would be a probabilistic one. I would create a reverse lookup table from characters to subset_ids, then proceed to add and remove subsets to balance around +0/+1 of the fixed number of subsets. When adding subsets, I'd add a randomly selected subset that contains the least populated letter, and when removing I'd select from the subsets containing the most populated letter. There should also exist a small chance of "mutating" and selecting a completely random subset to add/remove, to prevent being stuck in a local minimum.

I tried to code this solution, but it quickly degraded into some spaghetti code as I fixed edge cases and bugs. Its far from a polished solution, and may even return the wrong answer, but at least it might give you some ideas.

# Make lookup table
lookup = defaultdict(set)
for idx, subset in enumerate(subsets):
    for character in subset:
        lookup[character].add(idx)

best_score, best_subsets = 1, None
size = 10 # number of subsets to pick
subset_indices = set() # subset_ids
character_subsets = defaultdict(set) # subset_ids per letter
# loop some large number of times
for _ in range(10000):
    if len(subset_indices) > size: # remove elements
        idx = choice(list(subset_indices)) # maybe pick a random
        if random() < 0.9: # 90% chance pick an existing subset to remove
            indices = max(character_subsets.values(), key=len) # indices to pick from
            idx = choice(list(indices)) # pick one
        for character in subsets[idx]: # remove index/subset_id from lookup
            character_subsets[character].remove(idx)
        subset_indices.remove(idx) # remove subset_id from random draw pool
    else: # add a new subset
        idx = choice(list(set(range(len(subsets))) - subset_indices)) # invert random selection
        if random() < 0.9: # 90% chance to pick a new subset from the min populated
            i, indices = min(character_subsets.items(), key=lambda x:len(x[1]), default=(randint(0, len(lookup)-1),set()))
            indices = lookup[i] - indices # invert
            if not indices: continue # abort if empty
            idx = choice(list(indices)) # pick
        for character in subsets[idx]:
            character_subsets[character].add(idx) # update dict
        subset_indices.add(idx) # update random selection set
    score = pstdev(map(len, character_subsets.values())) # measure distribution
    if score < best_score and len(subset_indices) == size: # if better
        best_subsets = dict(character_subsets) # record it
        best_score = score

# do logic to pretty-print or process best_subset however you like
Dillon Davis
  • 6,679
  • 2
  • 15
  • 37
  • Sorry for late response. Can you elaborate this part? `When adding subsets, I'd add a randomly selected subset that contains the least populated letter, and when removing I'd select from the subsets containing the most populated letter` – enamoria Mar 05 '19 at 03:20
  • 1
    @enamoria You want to balance the # of occurrences of each letter, so when we add a new subset, we pick one containing the letter with least occurrences. However, to "shuffle" things around a bit hoping it'll settle its way into a global minimum, we also remove a subset (to make room for adding another). This subset that we remove should be one containing the letter with most occurrences, to bring it down and hopefully balance things a bit. – Dillon Davis Mar 05 '19 at 03:28
  • 1
    I drew ideas from [walksat](https://cseweb.ucsd.edu/~elkan/150/oct25.html) when crafting this algorithm, if that helps give you any intuition behind my reasoning. – Dillon Davis Mar 05 '19 at 03:30