2

I'm trying to get n random and non-overlapping slices of a sequence where each subsequence is of length l, preferably in the order they appear.

This is the code I have so far and it's gotten more and more messy with each attempt to make it work, needless to say it doesn't work.

def rand_parts(seq, n, l):
    """
    return n random non-overlapping partitions each of length l.
    If n * l > len(seq) raise error.
    """
    if n * l > len(seq):
        raise Exception('length of seq too short for given n, l arguments')
    if not isinstance(seq, list):
        seq = list(seq)
    gaps = [0] * (n + 1)
    for g in xrange(len(seq) - (n * l)):
        gaps[random.randint(0, len(gaps) - 1)] += 1
    result = []
    for i, g in enumerate(gaps):
        x = g + (i * l)
        result.append(seq[x:x+l])
        if i < len(gaps) - 1:
            gaps[i] += x
    return result

For example if we say rand_parts([1, 2, 3, 4, 5, 6], 2, 2) there are 6 possible results that it could return from the following diagram:

[1, 2, 3, 4, 5, 6]
 ____  ____

[1, 2, 3, 4, 5, 6]
 ____     ____ 

[1, 2, 3, 4, 5, 6]
 ____        ____ 

[1, 2, 3, 4, 5, 6]
    ____  ____ 

[1, 2, 3, 4, 5, 6]
    ____     ____ 

[1, 2, 3, 4, 5, 6]
       ____  ____

So [[3, 4], [5, 6]] would be acceptable but [[3, 4], [4, 5]] wouldn't because it's overlapping and [[2, 4], [5, 6]] also wouldn't because [2, 4] isn't contiguous.

I encountered this problem while doing a little code golfing so for interests sake it would also be nice to see both a simple solution and/or an efficient one, not so much interested in my existing code.

Peter O.
  • 32,158
  • 14
  • 82
  • 96
RussW
  • 427
  • 4
  • 11

5 Answers5

8
def rand_parts(seq, n, l):
    indices = xrange(len(seq) - (l - 1) * n)
    result = []
    offset = 0
    for i in sorted(random.sample(indices, n)):
        i += offset
        result.append(seq[i:i+l])
        offset += l - 1
    return result

To understand this, first consider the case l == 1. Then it's basically just returning a random.sample() of the input data in sorted order; in this case the offset variable is always 0.

The case where l > 1 is an extension of the previous case. We use random.sample() to pick up positions, but maintain an offset to shift successive results: in this way, we make sure that they are non-overlapping ranges --- i.e. they start at a distance of at least l of each other, rather than 1.

Armin Rigo
  • 12,048
  • 37
  • 48
  • Very nice. The only thing that can be done faster (but not as elegant) is the sorting which, for this case, can work in O(n) time and O(n) space. But in Python, without real arrays, this point is rather meaningless. – nickie Sep 05 '13 at 19:39
  • This works perfectly and `random.sample` handles the error case I was handling in my code. I guess this might seem simple but I'm having trouble understanding what `len(seq) - (l - 1) * n` and how you determined it. – RussW Sep 05 '13 at 21:01
  • @RussW: how it works it's already stated in the answer... let me try reword it. The case of `L=1` is of course trivial, just do a random sample and you're done. Now think to all the other elements as "extras" and just ignore them for a while and concentrate only on the "heads" of the subsequences. Those "extra" elements are `n*(L-1)` so you take them away, random-chose the "heads" and then compute the result by adding the proper "tails". Don't feel ashamed for not understanding it immediately, it happens very often when talking with Armin... he does his best but he's just too smart :-D – 6502 Sep 06 '13 at 06:43
  • @6502 Thanks. So `(l - 1) * n` represents the sum of the tails, `random.sample` randomly selects the indices for the heads from a range of len(seq) - sum of tails. I'm just trying to visualize it so I can easily arrive at it in future. It is actually quite simple after all. – RussW Sep 06 '13 at 09:00
1

Many solutions can be hacked for this problem, but one has to be careful if the sequences are to be strictly random. For example, it's wrong to begin by picking a random number between 0 and len(seq)-n*l and say that the first sequence will start there, then work recursively.

The problem is equivalent to selecting randomly n+1 integer numbers such that their sum is equal to len(seq)-l*n. (These numbers will be the "gaps" between your sequences.) To solve it, you can see this question.

Community
  • 1
  • 1
nickie
  • 5,608
  • 2
  • 23
  • 37
  • The question you pointed to is not useful in this case: it is about real numbers. – Armin Rigo Sep 05 '13 at 16:34
  • I don't think that rounding will be so much of an issue, so I think it's relevant. – nickie Sep 05 '13 at 16:37
  • `3.4 + 2.4 + 0.2 == 6.0` but `round(3.4) + round(2.4) + round(0.2) == 5.0`, for more or less any definition of rounding. – Armin Rigo Sep 05 '13 at 16:51
  • @ArminRigo, you won't simply round like this, you have to keep track of the fractional parts, but this is not an issue. – nickie Sep 05 '13 at 17:22
  • I think that's what I was trying to do in my code but hadn't defined it as well as you did. – RussW Sep 05 '13 at 20:04
1

This worked for me in Python 3.3.2. It should be backwards compatible with Python 2.7.

from random import randint as r

def greater_than(n, lis, l):
    for element in lis:
        if n < element + l:
            return False
    return True

def rand_parts(seq, n, l):
    """
    return n random non-overlapping partitions each of length l.
    If n * l > len(seq) raise error.
    """
    if n * l > len(seq):
        raise(Exception('length of seq too short for given n, l arguments'))
    if not isinstance(seq, list):
        seq = list(seq)
    # Setup
    left_to_do = n
    tried = []
    result = []
    # The main loop
    while left_to_do > 0:
        while True:
            index = r(0, len(seq) - 1)
            if greater_than(index, tried, l) and index <= len(seq) - left_to_do * l:
                tried.append(index)
                break
        left_to_do -= 1
        result.append(seq[index:index+l])
    # Done
    return result

a = [1, 2, 3, 4, 5, 6]
print(rand_parts(a, 3, 2))

The above code will always print [[1, 2], [3, 4], [5, 6]]

Shashank
  • 13,713
  • 5
  • 37
  • 63
  • This doesn't give equal probabilities. Try for example: `a = [1, 2, 3]; count = {(1,2):0, (1,3):0, (2,3):0}; for i in range(2000): [x], [y] = rand_parts(a, 2, 1); count[x, y] += 1; print count`. It picks the solution `[[2], [3]]` about half the time, and the other two solutions only a quarter of the time each. – Armin Rigo Sep 06 '13 at 08:25
0

If you do it recursively it's much simpler. Take the first part from (so the rest will fit):

 [0:total_len - (numer_of_parts - 1) * (len_of_parts)]

and then recurse with what left to do:

rand_parts(seq - begining _to_end_of_part_you_grabbed, n - 1, l)
Paul Evans
  • 27,315
  • 3
  • 37
  • 54
  • This is precisely the solution that does not generate really random answers. The probability for picking the first one should not be uniform, it should be biased by the number of answers that exist after you pick a number for the first one. – nickie Sep 05 '13 at 17:51
0

First of all, I think you need to clarify what you mean by the term random.

How can you generate a truly random list of sub-sequences when you are placing specific restrictions on the sub-sequences themselves?

As far as I know, the best "randomness" anyone can achieve in this context is generating all lists of sub-sequences that satisfy your criteria, and selecting from the pool however many you need in a random fashion.

Now based on my experience from an algorithms class that I've taken a few years ago, your problem seems to be a typical example which could be solved using a greedy algorithm making these big (but likely?) assumptions about what you were actually asking in the first place:

  • What you actually meant by random is not that a list of sub-sequence should be generated randomly (which is kind of contradictory as I said before), but that any of the solutions that could be produced is just as valid as the rest (e.g. any of the 6 solutions is valid from input [1,2,3,4,5,6] and you don't care which one)
  • Restating the above, you just want any one of the possible solutions that could be generated, and you want an algorithm that can output one of these valid answers.

Assuming the above here is a greedy algorithm which generates one of the possible lists of sub-sequences in linear time (excluding sorting, which is O(n*log(n))):

def subseq(seq, count, length):
    s = sorted(list(set(seq)))

    result = []
    subseq = []

    for n in s:
        if len(subseq) == length:
            result.append(subseq)
            if len(result) == count:
                return result
            subseq = [n]
        elif len(subseq) == 0:
            subseq.append(n)
        elif subseq[-1] + 1 == n:
            subseq.append(n)
        elif subseq[-1] + 1 < n:
            subseq = [n]

    print("Impossible!") 

The gist of the algorithm is as follows:

  • One of your requirements is that there cannot be any overlaps, and this ultimately implies you need to deal with unique numbers and unique numbers only. So I use the set() operation to get rid all the duplicates. Then I sort it.
  • Rest is pretty straight forward imo. I just iterate over the sorted list and form sub-sequences greedily.
  • If the algorithm can't form enough number of sub-sequences then print "Impossible!"

Hope this was what you were looking for.

EDIT: For some reason I wrongly assumed that there couldn't be repeating values in a sub-sequence, this one allows it.

def subseq2(seq, count, length):
    s = sorted(seq)

    result = []
    subseq = []

    for n in s:
        if len(subseq) == length:
            result.append(subseq)
            if len(result) == count:
                return result
            subseq = [n]
        elif len(subseq) == 0:
            subseq.append(n)
        elif subseq[-1] + 1 == n or subseq[-1] == n:
            subseq.append(n)
        elif subseq[-1] + 1 < n:
            subseq = [n]

    print("Impossible!")
Joohwan
  • 2,374
  • 1
  • 19
  • 30
  • I do get what you mean. I paused and thought about that when trying to define the problem but it's not really contradictory to 'randomness'. The problem of returning a random integer i.e. `random.randint(a, b)`, is constrained by result being an integer and the bounds a and b. Similarly, there are p different possible ways to retrieve n sub sequences each of length l that are contiguous and non-overlapping. So the problem is to return any random item from the set of p possibilities. – RussW Sep 05 '13 at 19:58