3

I want to generate every distinct combination of splitting the lower-case alphabet into two sets of six letters and two sets of seven letters. The order of the letters within the sets doesn't matter, i.e. if two solutions differ only in the order of the letters within the sets, then those solutions are identical.

I.e. these two solutions are identical:

[a,b,c,d,e,f][g,h,i,j,k,l][m,n,o,p,q,r,s][t,u,v,w,x,y,z]
[f,b,c,d,e,a][l,h,i,j,k,g][s,n,o,p,q,r,m][z,u,v,w,x,y,t]

A naive approach might be to generate every permutation of the 26 letters plus 2 dummies, split these evenly into four groups and discard duplication solutions (and ignore the dummies when I use the data). But this seems rather inefficient. I'm sure there's a well-known algorithm out there to do this, but I'm struggling to search for this given the wide range of similar but different permutation/combination problems out there.

Is there an existing, named algorithm that can split nk elements into n sets of k elements, generating every combination of those sets? If not, I'll have to hack something together myself. But this feels like a problem that's already been solved.

Duncan Jones
  • 67,400
  • 29
  • 193
  • 254
  • 1
    This looks like it might be related: http://stackoverflow.com/q/1622574/474189, although I was rather hoping for a reference to a standard algorithm for solving this. – Duncan Jones Oct 21 '14 at 20:11
  • It seems like you could get some improvement by turning the problem on its head: Each letter has to go into one of four buckets, and the buckets have limited space, so recursively try putting each letter into each bucket that has room for it. This way you're only producing combinations rather than permutations. – StriplingWarrior Oct 21 '14 at 20:14
  • Thanks for the advice @StriplingWarrior, I'll give that some thought. Meanwhile, another post catches my eye: http://codereview.stackexchange.com/questions/12262/create-a-list-of-all-possible-combinations-of-elements-into-n-groups-from-a-set. – Duncan Jones Oct 21 '14 at 20:17
  • 1
    Do you realize there are in the order of `10^12` of such distinct combinations? Do you really want them all? – Vincent van der Weele Oct 21 '14 at 20:59
  • Just out of curiosity, how do you intend to use these combinations? There's a good chance that there's a better way to solve the problem you're trying to solve. – StriplingWarrior Oct 21 '14 at 21:33
  • @StriplingWarrior I'm analysing the most effective distribution of keys on a four key keyboard. I have a test that determines the number of clashes in a predictive text system, based on a key layout. In a perfect world, I'd slowly crunch through all the available combinations to see which is best. Other approaches I've considered involve making educated guesses based on letter frequencies. – Duncan Jones Oct 22 '14 at 05:39
  • @VincentvanderWeele No, I hadn't realised there would be that many! – Duncan Jones Oct 22 '14 at 05:40
  • I'm pretty sure that you could use pure logic to solve this without any guessing, and without combing through all possible combinations. What comes to mind is a weighted graph that keeps track of frequency and average distances between letters as you feed data into it. I'm not super-strong on graph theory, but it seems like you could use a 2D mapping algorithm then to come up with the optimal placement based on that information. – StriplingWarrior Oct 22 '14 at 16:18
  • @StriplingWarrior That's probably a few more intelligent approach. Thanks for the pointer. I'll accept your answer below, since you've addressed the original question. – Duncan Jones Oct 22 '14 at 17:05

2 Answers2

4

I don't know of any algorithm names for this (though one probably exists), but the approach I mentioned in the comments avoids dealing with duplicates and is as efficient as I imagine you can get.

It seems like you could get some improvement by turning the problem on its head: Each letter has to go into one of four buckets, and the buckets have limited space, so recursively try putting each letter into each bucket that has room for it. This way you're only producing combinations rather than permutations.

Here's a C# implementation. It can generate 10,000,000 combinations in under 30 seconds, and 2/3 of that time is spent just in building the string outputs:

void Main()
{
    // Tweak these starting values to create smaller subsets if you want.
    var letters = Enumerable.Range(0, 26).Select(i => (char)('a' + i)).ToList();
    var buckets = new[]{new Bucket(6), new Bucket(6), new Bucket(7), new Bucket(7)};
    // I'm only taking 100 values because otherwise this would take a really long time.
    var combos = Combos(letters, 0, buckets).Take(100);
    foreach (var combo in combos)
    {
        Console.WriteLine(combo);
    }
}

public class Bucket : List<char>
{
    public int MaxLoad {get; private set;}
    public Bucket(int capacity) : base(capacity)
    {
        MaxLoad = capacity;
    }
}

// Define other methods and classes here
IEnumerable<string> Combos(IList<char> letters, int currentIndex, Bucket[] buckets)
{
    if(currentIndex == letters.Count){
        yield return string.Join("|", buckets.Select(b => string.Join(",", b)));
        yield break;
    }
    var currentLetter = letters[currentIndex];
    foreach (var bucket in buckets)
    {
        if(bucket.Count < bucket.Capacity)
        {
            bucket.Add(currentLetter);
            foreach (var possibility in Combos(letters, currentIndex + 1, buckets))
            {
                yield return possibility;
            }
            bucket.Remove(currentLetter);
        }
    }
}

Sample output:

a,b,c,d,e,f|g,h,i,j,k,l|m,n,o,p,q,r,s|t,u,v,w,x,y,z
a,b,c,d,e,f|g,h,i,j,k,l|m,n,o,p,q,r,t|s,u,v,w,x,y,z
a,b,c,d,e,f|g,h,i,j,k,l|m,n,o,p,q,r,u|s,t,v,w,x,y,z
a,b,c,d,e,f|g,h,i,j,k,l|m,n,o,p,q,r,v|s,t,u,w,x,y,z
a,b,c,d,e,f|g,h,i,j,k,l|m,n,o,p,q,r,w|s,t,u,v,x,y,z
a,b,c,d,e,f|g,h,i,j,k,l|m,n,o,p,q,r,x|s,t,u,v,w,y,z
a,b,c,d,e,f|g,h,i,j,k,l|m,n,o,p,q,r,y|s,t,u,v,w,x,z
a,b,c,d,e,f|g,h,i,j,k,l|m,n,o,p,q,r,z|s,t,u,v,w,x,y
a,b,c,d,e,f|g,h,i,j,k,l|m,n,o,p,q,s,t|r,u,v,w,x,y,z
a,b,c,d,e,f|g,h,i,j,k,l|m,n,o,p,q,s,u|r,t,v,w,x,y,z
a,b,c,d,e,f|g,h,i,j,k,l|m,n,o,p,q,s,v|r,t,u,w,x,y,z
a,b,c,d,e,f|g,h,i,j,k,l|m,n,o,p,q,s,w|r,t,u,v,x,y,z
a,b,c,d,e,f|g,h,i,j,k,l|m,n,o,p,q,s,x|r,t,u,v,w,y,z
a,b,c,d,e,f|g,h,i,j,k,l|m,n,o,p,q,s,y|r,t,u,v,w,x,z
a,b,c,d,e,f|g,h,i,j,k,l|m,n,o,p,q,s,z|r,t,u,v,w,x,y
a,b,c,d,e,f|g,h,i,j,k,l|m,n,o,p,q,t,u|r,s,v,w,x,y,z
a,b,c,d,e,f|g,h,i,j,k,l|m,n,o,p,q,t,v|r,s,u,w,x,y,z
a,b,c,d,e,f|g,h,i,j,k,l|m,n,o,p,q,t,w|r,s,u,v,x,y,z
a,b,c,d,e,f|g,h,i,j,k,l|m,n,o,p,q,t,x|r,s,u,v,w,y,z
a,b,c,d,e,f|g,h,i,j,k,l|m,n,o,p,q,t,y|r,s,u,v,w,x,z
a,b,c,d,e,f|g,h,i,j,k,l|m,n,o,p,q,t,z|r,s,u,v,w,x,y
a,b,c,d,e,f|g,h,i,j,k,l|m,n,o,p,q,u,v|r,s,t,w,x,y,z
a,b,c,d,e,f|g,h,i,j,k,l|m,n,o,p,q,u,w|r,s,t,v,x,y,z
a,b,c,d,e,f|g,h,i,j,k,l|m,n,o,p,q,u,x|r,s,t,v,w,y,z
a,b,c,d,e,f|g,h,i,j,k,l|m,n,o,p,q,u,y|r,s,t,v,w,x,z
a,b,c,d,e,f|g,h,i,j,k,l|m,n,o,p,q,u,z|r,s,t,v,w,x,y
...

One nice feature of the approach I've given is that you can process the results as they are produced--you don't have to wait for the whole list to be generated, and you don't need to have all the combinations in memory at the same time.

But be aware that you're going to end up with a lot of combinations--quite possibly more than the computer can generate in any reasonable amount of time regardless of algorithmic efficiency. If Vincent's estimate of 10^12 is correct, for example, it would take roughly a year using the code above. You might be able to optimize it down to a month or so. Parallelization might take it down to a week on a really strong computer.

StriplingWarrior
  • 151,543
  • 27
  • 246
  • 315
  • +1 I appreciate you taking the time to code this up. It certainly seems like an elegant solution. Perhaps there isn't a well-worn algorithm for this - there are so many similar, but different permutation/combination problems that perhaps many are just unsolved. I agree that this seems very efficient - I like how it can cope with "buckets" of different sizes. – Duncan Jones Oct 22 '14 at 07:07
  • Very elegant solution. I've adapted to return a List of integers, avoiding the slowness of formatting strings. – Jone Polvora May 29 '17 at 23:44
0

This is a recursion problem.

If i wanted to find the list of all sets of length n containing some letter the easiest way to think about this is the list of all sets of length n-1 that do not contain this letter concatonated with the set of [letter] for each one, to avoid duplicates you discard all elements you have previously done

For example if i wanted to find the number of two letter combinations in the set [A-F] the answer is to take each element and find the combinations of that. So say i want to find all the combinations that contain A that would then be [A][B-F] then say i want to find all the combinations that contain B but not A to continue this could be [B][C-F] Doing this for all the letters a-f will give you all possible combinations of two letters A-F so now that combination becomes your list tail for the three letter combinations.

You would add a to all the two letter combinations that do not contain A, then you would add b to all the two letter combinations that do not contain a or b, and continue this to get all the three letter combinations.

You can continue this algorithm to have as many levels as you want, and it will find the combinations of all elements of a set of a given length

I know you aren't looking for code, but here is a c# implementation

public IList<string> Combos(IList<string> elements, int level)
    {
        if (level == 1)
        {
            return elements;
        }
        var combinations = new List<string>();
        var previousCombos = Combos(elements, level - 1);
        for (var i = 0; i < elements.Count; i++)
        {
            previousCombos.ToList().ForEach(item =>
            {
                if (!elements.Take(i+1).Any(item.Contains))
                {
                    combinations.Add(item + elements[i]);
                }
            });
        }
        return combinations;
    }

Just a word of warning this is incredibly inefficient, in fact i think it is an exponential algorithm, so don't use it on large data sets or sizes, or your computation will take forever.

theDarse
  • 749
  • 5
  • 13