Transform a set of large integers into a set of small ones

Question

How do we recode a set of strictly increasing (or strictly decreasing) positive integers P, to decrease the number of positive integers that can occur between the integers in our set?

Why would we want to do this: Say we want to randomly sample P but 1.) P is too large to enumerate, and 2.) members of P are related in a nonrandom way, but in a way that is too complicated to sample by. However, we know a member of P when we see it. Say we know P[0] and P[n] but can't entertain the idea of enumerating all of P or understanding precisely how members of P are related. Likewise, the number of all possible integers occurring between P[0] and P[n] are many times greater than the size of P, making the chance of randomly drawing a member of P very small.

Example: Let P[0] = 2101010101 & P[n] = 505050505. Now, maybe we're only interested in integers between P[0] and P[n] that have a specific quality (e.g. all integers in P[x] sum to Q or less, each member of P has 7 or less as the largest integer). So, not all positive integers P[n] <= X <= P[0] belong to P. The P I'm interested in is discussed in the comments below.

What I've tried: If P is a strictly decreasing set and we know P[0] and P[n], then we can treat each member as if it were subtracted from P[0]. Doing so decreases each number, perhaps greatly and maintains each member as a unique integer. For the P I'm interested in (below), one can treat each decreased value of P as being divided by a common denominator (9,11,99), which decreases the number of possible integers between members of P. I've found that used in conjunction, these approaches decrease the set of all P[0] <= X <= P[n] by a few orders of magnitude, making the chance of randomly drawing a member of P from all positive integers P[n] <= X <= P[0] still very small.

Note: As should be clear, we have to know something about P. If we don't, that basically means we have no clue of what we're looking for. When we randomly sample integers between P[0] and P[n] (recoded or not) we need to be able to say "Yup, that belongs to P.", if indeed it does.

A good answer could greatly increase the practical application of a computing algorithm I have developed. An example of the kind of P I'm interested in is given in comment 2. I am adamant about giving due credit.

Do you have the ability to map k to P[k]? How do you know what P[k] is? Some of this may depend on what these P[k] actually are and how they are represented. — mhum, Feb 16 '13 at 00:02
Each member in the set of P corresponds to an integer partition of Q with N parts. e.g. let Q = 20 and N = 4, then the first lexical partition for Q and N i.e [17,1,1,1] maps to P as 17010101 (i.e. P[0]). Consequently, the integer partition [5,5,5,5] corresponds to 5050505.The point? Randomly sample from the set of all integer partitions of Q having N parts without recursion and with a low rejection rate. There, I gave it all away. It's worth it for the sake of an answer. — klocey, Feb 16 '13 at 00:11
Might as well add some encouragement: This method is blinding fast when N is less than 5. By blinding fast, I mean it makes an event with a probability too small that my computer won't calculate it (e.g. using typical random partitioning algorithms) become very likely. — klocey, Feb 16 '13 at 00:14
Ok. Can we assume that zeros are not allowed and order doesn't matter? I think that's usual in integer partitions but I'm not sure about your case. If so, can you confirm that you're looking to uniformly sample partitions with exactly N parts? — mhum, Feb 16 '13 at 00:44
Padding with zeros (i.e. 17,1,1,1 -> 17010101) clearly increases the size of the numbers, greatly. However, not padding with zeros (i.e. 17,1,1,1 -> 17111) doesn't allow for (P[0] > P[1] > ...) and while it's easy to go from 17,1,l,1 to 17111, it can be hard or timely to randomly draw a number (e.g. 12332) and then know it decodes to 12,3,3,2. ...especially if, say, Q = 50000 and N = 953. In short the entire process is 1.) draw number at random in the range P[0] to P[n] 2.) decode it and see whether the resulting sequence belongs to P(Q,N) i.e., set of partitions of Q with N parts. — klocey, Feb 16 '13 at 01:07
I'm sorry, I wasn't clear. Can we assume that [20,0,0,0] is not a valid partition of 20? And can we assume that [17,1,1,1] is the same as [1,1,17,1]? — mhum, Feb 16 '13 at 01:08
So [20,0,0,0] would not be a valid partition. Also, in this case, order is important because every partition has to have the same chance of being drawn, and some lexically ordered partitions of Q and N have different numbers of unordered sequences (call'em microstates), which would make some integer partitions more likely than others. — klocey, Feb 16 '13 at 01:14
By treating order as distinct, you want to sample a partition containing the elements {1,2,3,14} 24 times more frequently than a partition containing the elements {5,5,5,5}, correct? Is this where the standard partition sampling algorithms have failed you? — mhum, Feb 16 '13 at 01:22
Standard partitioning algorithms don't sample w/respect to N and hence, require high rejection rates. I derived an algorithm that does sample according to Q&N (http://goo.gl/7C3L2), but like other algorithms it relies on recursion. Recursion is the problem because 1.) it can take a long time and 2.) it can easily max-out recursion limits and stack size for reasonable Q and N. The solution we're discussing is nonrecursive but gets held up by having many unusable numbers between P[0] and P[n]. — klocey, Feb 16 '13 at 01:33
To be clearer on your comment, and without talking about recoding integers, the partition [14,3,2,1] can only be represented by the number 14030201 or 14321 or something similarly unique. Reason being that [14,3,2,1] can't be allowed to have a higher probability of being drawn than [5,5,5,5] or [17,1,1,1]. — klocey, Feb 16 '13 at 01:43
Ok. Understood. I take it that the above [linked solution](http://goo.gl/7C3L2) is too slow for you and/or smashes the stack? I wouldn't have expected that, but I will take your word for it. I would have expected the recursion depth to be no more than N, but I guess not. — mhum, Feb 16 '13 at 01:50
Right, so the linked solution presents an elegant algorithm but the implementation of it uses recursion. I'd have to set the recursion limit pretty high for Q=3000 and N=200. Even if it worked, it would take a very long time. I've found other ways to get around recursion using very simple approaches (e.g. goo.gl/t9D5s), but the one we're discussing is a challenge. — klocey, Feb 16 '13 at 02:12
If what you are asking for is to give an efficient algorithm for uniformly sampling integer partitions of fixed length, then there is no known solution. — PengOne, Feb 16 '13 at 04:31
@PengOne I've derived 2 algorithms for generating uniform random integer partitions with respect to a total Q and number of parts N. One here: dx.doi.org/10.6084/m9.figshare.156290. The other is the one in this string of comments. The concept behind each is simple, and each is fast in it's own way. The approach discussed here does not rely on recursion, and is fast for combinations of Q and N that can't be dealt with using other algorithms. **But, the actual question is about recoding integers, not random integer partitioning.** Thanks though. — klocey, Feb 16 '13 at 05:31
Well we have to know *something* about P. You clearly can't do this for arbitrary P, as there is no way to reject any integer without evaluating P on it first. — Keith Randall, Feb 16 '13 at 07:12
Hi Keith. I refer you to the second comment in the string. It's explicit. However, the answer doesn't have to apply to my P. Feel free to choose P as long as it fits the conditions in the body of the question. — klocey, Feb 16 '13 at 08:19
Can you explain why you can't simply store the integer partition in an array or container of some sort and then simply use a pseudo random number generator to randomly choose one? I don't understand how you are sampling the data and why there are rejects. Can you edit your question and show an example of how you would like to sample the data? — Bob Bryan, Feb 16 '13 at 15:05
I want to keep the question from being confused with one of integer partitioning. I used integer partitioning as an example of how we might know something about P but still have the following: 1. P is a set of integers with differences between integers being >1 (e.g. P[0] = 480101, P[1] = 470201) 2.) P is too large to hold & takes too long to examine. 3.) We don't know the precise relationship between members of P, i.e. we can't say what the 10^9 member of P is even though P obeys rules (comment 2). But we can treat P as being recoded in a way that decreases differences between members. — klocey, Feb 16 '13 at 17:16
@BobBryan there is a rejection rate because P is much smaller than the set of all positive integers that are <= P[0] and >= P[n], and we are forced to randomly draw from that larger set obtain a uniform random sample of P. Treating P as being recoded can increase the chance of randomly drawing a member of P (see body of question), reducing the rejection rate, i.e. frequency of which we draw a number and say 'Crap, that doesn't belong to P.' — klocey, Feb 16 '13 at 18:04

mhum · Answer 1 · 2013-02-18T21:44:46.193

While the original question is asking about a very generic scenario concerning integer encodings, I would suggest that it is unlikely that there exists an approach that works in complete generality. For example, if the P[i] are more or less random (from an information-theoretic standpoint), I would be surprised if anything should work.

So, instead, let us turn our attention to the OP's actual problem of generating partitions of an integer N containing exactly K parts. When encoding with combinatorial objects as integers, it behooves us to preserve as much of the combinatorial structure as possible. For this, we turn to the classic text Combinatorial Algorithms by Nijenhuis and Wilf, specifically Chapter 13. In fact, in this chapter, they demonstrate a framework to enumerate and sample from a number of combinatorial families -- including partitions of N where the largest part is equal to K. Using the well-known duality between partitions with K parts and partitions where the largest part is K (take the transpose of the Ferrers diagram), we find that we only need to make a change to the decoding process.

Anyways, here's some source code:

import sys
import random
import time

if len(sys.argv) < 4 :
    sys.stderr.write("Usage: {0} N K iter\n".format(sys.argv[0]))
    sys.stderr.write("\tN = number to be partitioned\n")
    sys.stderr.write("\tK = number of parts\n")
    sys.stderr.write("\titer = number of iterations (if iter=0, enumerate all partitions)\n")
    quit()

N = int(sys.argv[1])
K = int(sys.argv[2])
iters = int(sys.argv[3])

if (N < K) :
    sys.stderr.write("Error: N<K ({0}<{1})\n".format(N,K))
    quit()

# B[n][k] = number of partitions of n with largest part equal to k
B = [[0 for j in range(K+1)] for i in range(N+1)] 

def calc_B(n,k) :
    for j in xrange(1,k+1) :
        for m in xrange(j, n+1) :
            if j == 1 :
                B[m][j] = 1
            elif m - j > 0 :
                B[m][j] = B[m-1][j-1] + B[m-j][j]
            else :
                B[m][j] = B[m-1][j-1]

def generate(n,k,r=None) :
    path = []
    append = path.append

    # Invalid input
    if n < k or n == 0 or k == 0: 
        return []

    # Pick random number between 1 and B[n][k] if r is not specified
    if r == None :
        r = random.randrange(1,B[n][k]+1)

    # Construct path from r    
    while r > 0 :
        if n==1 and k== 1:
            append('N')
            r = 0   ### Finish loop
        elif r <= B[n-k][k] and B[n-k][k] > 0  : # East/West Move
            append('E')
            n = n-k
        else : #  Northeast/Southwest move
            append('N')
            r -= B[n-k][k]
            n = n-1
            k = k-1

    # Decode path into partition    
    partition = []
    l = 0
    d = 0    
    append = partition.append    
    for i in reversed(path) :
        if i == 'N' :
            if d > 0 : # apply East moves all at once
                for j in xrange(l) :
                    partition[j] += d
            d = 0  # reset East moves
            append(1) # apply North move
            l += 1            
        else :
            d += 1 # accumulate East moves    
    if d > 0 : # apply any remaining East moves
        for j in xrange(l) :
            partition[j] += d

    return partition


t = time.clock()
sys.stderr.write("Generating B table... ")    
calc_B(N, K)
sys.stderr.write("Done ({0} seconds)\n".format(time.clock()-t))

bmax = B[N][K]
Bits = 0
sys.stderr.write("B[{0}][{1}]: {2}\t".format(N,K,bmax))
while bmax > 1 :
    bmax //= 2
    Bits += 1
sys.stderr.write("Bits: {0}\n".format(Bits))

if iters == 0 : # enumerate all partitions
    for i in xrange(1,B[N][K]+1) :
        print i,"\t",generate(N,K,i)

else : # generate random partitions
    t=time.clock()
    for i in xrange(1,iters+1) :
        Q = generate(N,K)
        print Q
        if i%1000==0 :
            sys.stderr.write("{0} written ({1:.3f} seconds)\r".format(i,time.clock()-t))

    sys.stderr.write("{0} written ({1:.3f} seconds total) ({2:.3f} iterations per second)\n".format(i, time.clock()-t, float(i)/(time.clock()-t) if time.clock()-t else 0))

And here's some examples of the performance (on a MacBook Pro 8.3, 2GHz i7, 4 GB, Mac OSX 10.6.3, Python 2.6.1):

mhum$ python part.py 20 5 10
Generating B table... Done (6.7e-05 seconds)
B[20][5]: 84    Bits: 6
[7, 6, 5, 1, 1]
[6, 6, 5, 2, 1]
[5, 5, 4, 3, 3]
[7, 4, 3, 3, 3]
[7, 5, 5, 2, 1]
[8, 6, 4, 1, 1]
[5, 4, 4, 4, 3]
[6, 5, 4, 3, 2]
[8, 6, 4, 1, 1]
[10, 4, 2, 2, 2]
10 written (0.000 seconds total) (37174.721 iterations per second)

mhum$ python part.py 20 5 1000000 > /dev/null
Generating B table... Done (5.9e-05 seconds)
B[20][5]: 84    Bits: 6
100000 written (2.013 seconds total) (49665.478 iterations per second)

mhum$ python part.py 200 25 100000 > /dev/null
Generating B table... Done (0.002296 seconds)
B[200][25]: 147151784574    Bits: 37
100000 written (8.342 seconds total) (11987.843 iterations per second)

mhum$ python part.py 3000 200 100000 > /dev/null
Generating B table... Done (0.313318 seconds)
B[3000][200]: 3297770929953648704695235165404132029244952980206369173   Bits: 181
100000 written (59.448 seconds total) (1682.135 iterations per second)

mhum$ python part.py 5000 2000 100000 > /dev/null
Generating B table... Done (4.829086 seconds)
B[5000][2000]: 496025142797537184410324290349759736884515893324969819660    Bits: 188
100000 written (255.328 seconds total) (391.653 iterations per second)

mhum$ python part-final2.py 20 3 0
Generating B table... Done (0.0 seconds)
B[20][3]: 33    Bits: 5
1   [7, 7, 6]
2   [8, 6, 6]
3   [8, 7, 5]
4   [9, 6, 5]
5   [10, 5, 5]
6   [8, 8, 4]
7   [9, 7, 4]
8   [10, 6, 4]
9   [11, 5, 4]
10  [12, 4, 4]
11  [9, 8, 3]
12  [10, 7, 3]
13  [11, 6, 3]
14  [12, 5, 3]
15  [13, 4, 3]
16  [14, 3, 3]
17  [9, 9, 2]
18  [10, 8, 2]
19  [11, 7, 2]
20  [12, 6, 2]
21  [13, 5, 2]
22  [14, 4, 2]
23  [15, 3, 2]
24  [16, 2, 2]
25  [10, 9, 1]
26  [11, 8, 1]
27  [12, 7, 1]
28  [13, 6, 1]
29  [14, 5, 1]
30  [15, 4, 1]
31  [16, 3, 1]
32  [17, 2, 1]
33  [18, 1, 1]

I'll leave it to the OP to verify that this code indeed generates partitions according to the desired (uniform) distribution.

EDIT: Added an example of the enumeration functionality.

I compared the output of the function to that of a function known to be unbiased but slower. I generated random samples from each function and compared the distribution of statistical features (skewness, evenness, median summand value) as in goo.gl/uNWR4. If the function was unbiased, the resulting kernel density curves would not reveal any systematic differences. As it is, they are close but different. So, a bias has creeped in. I will spend tomorrow studying what you've done (which is very cool) and trying to find where the bias is. I can email/post the output/results/script if you like. — klocey, Feb 18 '13 at 03:11
Ok. I only tested a few cases where N & K were small enough that I could obtain decent samples for all possible partitions. Can you describe the discrepancies? And for which N & K you were testing? I'll take another pass at this and try to debug. — mhum, Feb 18 '13 at 03:18
I used N={20,40} & K={3,5,10} small enough to examine distributional features of feasible sets (i.e. all partitions for N&K), which I compared using kernel density curves, to random samples generated w/ your method. Statistical evenness was higher than it should be. Skewness was often lower than it should be. The medium summand value showed multi-modality but should be unimodal. Kdensity curves for random samples were in high agreement with each other but not with the feasible sets. — klocey, Feb 18 '13 at 09:07
Hmm. I rechecked and was unable to find any bugs. Are you sure about your statistics? For example, median summand is not always unimodal. For N=40, K=10, there are 791 partitions with median 2, 422 with 2.5, and 922 with 3. I've edited the generate() function to provide an enumeration mode so you can verify that it provides a bijection between partitions and the integers 1..B[N][K]. — mhum, Feb 18 '13 at 18:40
I dunno. Kdensity curves for feasible sets & random samples should be in high agreement. Other much slower algorithms show high agreement when comparing kdensity curves (modality influenced by kernel width). Can there be an inherent bias in what you're trying to do? My comparisons are based on code used in my preprint on figshare where kdensity curves for random samples using 2 different algorithms are indistinguishable & results show no bias in comparison to feasible sets. Maybe I implemented your code wrong? I can post the necessary functions for you to do comparisons. — klocey, Feb 18 '13 at 21:17
I guess I don't quite understand your comparison method. For example, why are you using kernel density estimates when for such small N and K, you can explicitly enumerate all partitions and compute these statistics (median, skew, etc..) directly? In any case, you can verify in my updated code that for each r, generate(N,K,r) produces a unique partition of N in K parts (and all such partitions are represented), thus sampling uniformly from r should sample uniformly from all such partitions. — mhum, Feb 18 '13 at 21:27
Use of kdensity curves may not be clear. This might help: 1) find evenness, skewness, etc. for the feasible set(i.e. every partition for N&S). 2) generate a frequency distribution (kdensity) revealing evenness, skewness, etc. across the feasible set. 3) compare the kdensity curve for the feasible set to kdensity curves for random samples, each composed of many random generated partitions. If random samples are unbiased, kdensity curves for them should lie on that for the feasible set. The approach provides a visual comparison and reveals where/how a method is failing. — klocey, Feb 18 '13 at 21:59
Ok. The part I don't think I understand is why you would use a kernel density estimate of the frequency distribution instead of the frequency distribution itself? I didn't think you'd need to estimate anything since the exact distribution is known via enumeration. In fact, for small N & K, you may not need to compare these statistics. You can just check for uniformity in the number of each partition generated (say, by a chi-square test). E.g.: if you generate 33000 partitions for N=20, K=3, you should expect to see about 1000 of each partition. — mhum, Feb 18 '13 at 22:13
Yes, you're totally right. However, I use kdensity curves because usually, I'm comparing hundreds of frequency distributions (random algorithm 1) to hundreds of others (random algorithm 2) and looking for systematic directional bias or weird modal behavior...and often for larger N & K. So, obvious bias in kdensity curves suggests something's wrong and maybe how/where it's wrong. But, high agreement doesn't necessarily mean everything's right...which is why I look at multiple statistical features (evenness, skew, median summand, Gini's coefficient, etc.). — klocey, Feb 18 '13 at 22:27

klocey · Answer 2 · 2013-02-18T21:36:18.377

Below is a script that accomplishes what I've asked, as far as recoding integers that represent integer partitions of N with K parts. A better recoding method is needed for this approach to be practical for K > 4. This is definitely not a best or preferred approach. However, it's conceptually simple and easily argued as fundamentally unbiased. It's also very fast for small K. The script runs fine in Sage notebook and does not call Sage functions. It is NOT a script for random sampling. Random sampling per se is not the problem.

The method:

1.) Treat integer partitions as if their summands are concatenated together and padded with zeros according to size of largest summand in first lexical partition, e.g. [17,1,1,1] -> 17010101 & [5,5,5,5] -> 05050505

2.) Treat the resulting integers as if they are subtracted from the largest integer (i.e. the int representing the first lexical partition). e.g. 17010101 - 5050505 = 11959596

3.) Treat each resulting decreased integer as divided by a common denominator, e.g. 11959596/99 = 120804

So, if we wanted to choose a random partition we would:

1.) Choose a number between 0 and 120,804 (instead of a number between 5,050,505 and 17,010,101)

2.) Multiply the number by 99 and substract from 17010101

3.) Split the resulting integer according to how we treated each integer as being padded with 0's

Pro's and Con's: As stated in the body of the question, this particular recoding method doesn't do enough to greatly improve the chance of randomly selecting an integer representing a member of P. For small numbers of parts, e.g. K < 5 and substantially larger totals, e.g. N > 100, a function that implements this concept can be very fast because the approach avoids timely recursion (snake eating its tail) that slows other random partition functions or makes other functions impractical for dealing with large N.

At small K, the probability of drawing a member of P can be reasonable when considering how fast the rest of the process is. Coupled with quick random draws, decoding, and evaluation, this function can find uniform random partitions for combinations of N&K (e.g. N = 20000, K = 4) that are untennable with other algorithms. A better way to recode integers is greatly needed to make this a generally powerful approach.

import random
import sys

First, some generally useful and straightforward functions

def first_partition(N,K):
    part = [N-K+1]
    ones = [1]*(K-1)
    part.extend(ones)
return part

def last_partition(N,K):
    most_even = [int(floor(float(N)/float(K)))]*K
    _remainder = int(N%K)
    j = 0

    while _remainder > 0:
        most_even[j] += 1
        _remainder -= 1
        j += 1
    return most_even

def first_part_nmax(N,K,Nmax):
    part = [Nmax]
    N -= Nmax
    K -= 1
    while N > 0:
        Nmax = min(Nmax,N-K+1)
        part.append(Nmax)
        N -= Nmax
        K -= 1

    return part


#print first_partition(20,4)
#print last_partition(20,4)
#print first_part_nmax(20,4,12)
#sys.exit()

def portion(alist, indices): 
    return [alist[i:j] for i, j in zip([0]+indices, indices+[None])]

def next_restricted_part(part,N,K): # *find next partition matching N&K w/out recursion

    if part == last_partition(N,K):return first_partition(N,K)
    for i in enumerate(reversed(part)):
        if i[1] - part[-1] > 1:
            if i[0] == (K-1):
                return first_part_nmax(N,K,(i[1]-1))

            else:
                parts = portion(part,[K-i[0]-1]) # split p
                h1 = parts[0]
                h2 = parts[1]
                next = first_part_nmax(sum(h2),len(h2),(h2[0]-1))
                return h1+next

""" *I don't know a math software that has this function and Nijenhuis and Wilf (1978) 
 don't give it (i.e. NEXPAR is not restricted by K). Apparently, folks often get the
 next restricted part using recursion, which is unnecessary """

def int_to_list(i): # convert an int to a list w/out padding with 0'
    return [int(x) for x in str(i)]

def int_to_list_fill(i,fill):# convert an int to a list and pad with 0's
    return [x for x in str(i).zfill(fill)]

def list_to_int(l):# convert a list to an integer
    return "".join(str(x) for x in l)

def part_to_int(part,fill):# convert an int to a partition of K parts
# and pad with the respective number of 0's

    p_list = []
    for p in part:
        if len(int_to_list(p)) != fill:
            l = int_to_list_fill(p,fill)
            p = list_to_int(l)
        p_list.append(p)
    _int = list_to_int(p_list)
    return _int       

def int_to_part(num,fill,K): # convert an int to a partition of K parts
    # and pad with the respective number of 0's
    # This function isn't called by the script, but I thought I'd include
    # it anyway because it would be used to recover the respective partition

    _list = int_to_list(num)
    if len(_list) != fill*K:
        ct = fill*K - len(_list) 
        while ct > 0:    
            _list.insert(0,0)
            ct -= 1    
    new_list1 = []
    new_list2 = []

    for i in _list:
        new_list1.append(i)
        if len(new_list1) == fill:
            new_list2.append(new_list1)
            new_list1 = []

    part = []
    for i in new_list2:
        j = int(list_to_int(i))
        part.append(j)

    return part

Finally, we get to the total N and number of parts K. The following will print partitions satisfying N&K in lexical order, with associated recoded integers

N = 20
K = 4

print '#,  partition,  coded,    _diff,    smaller_diff'

first_part = first_partition(N,K) # first lexical partition for N&K
fill = len(int_to_list(max(first_part)))
# pad with zeros to 1.) ensure a strictly decreasing relationship w/in P,
# 2.) keep track of (encode/decode) partition summand values

first_num = part_to_int(first_part,fill)
last_part = last_partition(N,K)
last_num = part_to_int(last_part,fill)

print '1',first_part,first_num,'',0,'      ',0

part = list(first_part)
ct = 1
while ct < 10:
    part = next_restricted_part(part,N,K)
    _num = part_to_int(part,fill)

    _diff = int(first_num) - int(_num)
    smaller_diff = (_diff/99)
    ct+=1

    print ct, part, _num,'',_diff,' ',smaller_diff

OUTPUT:

ct, partition, coded, _diff, smaller_diff

1 [17, 1, 1, 1] 17010101 0 0

2 [16, 2, 1, 1] 16020101 990000 10000

3 [15, 3, 1, 1] 15030101 1980000 20000

4 [15, 2, 2, 1] 15020201 1989900 20100

5 [14, 4, 1, 1] 14040101 2970000 30000

6 [14, 3, 2, 1] 14030201 2979900 30100

7 [14, 2, 2, 2] 14020202 2989899 30201

8 [13, 5, 1, 1] 13050101 3960000 40000

9 [13, 4, 2, 1] 13040201 3969900 40100

10 [13, 3, 3, 1] 13030301 3979800 40200

In short, integers in the last column could be a lot smaller.

Why a random sampling strategy based on this idea is fundamentally unbiased:

Each integer partition of N having K parts corresponds to one and only one recoded integer. That is, we don't pick a number at random, decode it, and then try to rearrange the elements to form a proper partition of N&K. Consequently, each integer (whether corresponding to partitions of N&K or not) has the same chance of being drawn. The goal is to inherently reduce the number of integers not corresponding to partitions of N with K parts, and so, to make the process of random sampling faster.

I'm still a little unclear why you feel like the correct approach is to improve on your concatenation encoding rather than coming up with a better initial encoding. It seems like your previous [solution](http://stackoverflow.com/questions/10287021/an-algorithm-for-randomly-generating-integer-partitions-of-a-particular-length/12742508#12742508) as well as the solution I present here both compute a bijection between partitions of N with K parts and [1..M] where M is the total number of such partitions. You can't get a more compact encoding. — mhum, Feb 18 '13 at 21:22
I wouldn't say I feel it's the correct approach. It's just a method I used to generate uniform random partitions of N with K parts without recursion that is simplistic, fundamentally unbiased, and unbiased as implemented. Of course, it's insanely impractical when K > 4. Once again, the idea being that recursion is the enemy for some random partitioning problems. — klocey, Feb 18 '13 at 21:30
Your previous solution and the solution I outline here are basically the same, except I use a slightly different recursion. The problems you had in executing your previous solution (speed, stack smashing) can probably be mitigated if you do what I did and just pre-compute D iteratively. — mhum, Feb 18 '13 at 21:41
That might work. As far as I could tell, your way of doing it was pretty quick and worked very well. I'll give it a try. Thanks! — klocey, Feb 18 '13 at 22:01
I'm not sure what you mean here. Are you referring to recursion in a programming sense or in a mathematical sense? While my implementation does not use programmatic recursion, the computation of the number of partitions is still defined recursively (i.e.: B[n][k] = B[n-k][k] + B[n-1][k-1]). While you had problems with (programmatic) recursion smashing your stack in your previous implementation, as I said, you can usually unroll these things. — mhum, Feb 18 '13 at 22:20
And, if you're referring to programmatic recursion, I would refer you again to Nijenhuis and Wilf where all the code is in Fortran 4 which does not allow recursion at all. So, people have been generating combinatorial objects without recursion for some time now. — mhum, Feb 18 '13 at 22:24
I mean a function that calls itself to arrive at some initial value and then passes values back up the chain, as in many implemented partition functions. I suppose, I like the idea of getting a uniform random partition w/out recursion and by not looping over possible summand values or multiplicities. The best (only) way I've found to, say, get a uniform random sample of partitions for 400,000 with 4 parts is to treat partitions as a different kind of object (e.g. recoded integer). Maybe implementing your way of getting D into my previous (figshare) algorithm will do it. — klocey, Feb 18 '13 at 22:41
Deleting the comment you referred to was a knee jerk (i.e. oh crap, I'm totally wrong). I failed to see you made a nice response to it. Apologies. — klocey, Feb 18 '13 at 22:58
Ok. Fortran didn't get recursion until Fortran 90 (although there were hacky workarounds), so all of the algorithms in Nijenhuis and Wilf are non-recursive. Also, when I look at your [previous implementation](http://stackoverflow.com/questions/10287021/an-algorithm-for-randomly-generating-integer-partitions-of-a-particular-length/12742508#12742508), I think you actually did efficiently code the partitions into integers -- you take an integer from 1 to N (the 'which' variable) and turn it into a partition. I think your problem there is just an inefficient computation of D. — mhum, Feb 18 '13 at 23:01
Also, you might be able to do a bit better with a slightly different recursive definition. I took mine directly from Nijenhuis and Wilf and it only has two indices. Yours is a little more complicated with 3 indices. But I'm not sure if that will make a big difference. — mhum, Feb 18 '13 at 23:07

Transform a set of large integers into a set of small ones

2 Answers2