4

My question is similar to this one, but with the difference that I need an array of zeros and ones as output. I have an original time series of zeroes and ones with high autocorrelation (i.e., the ones are clustered). For some significance-testing I need to create random arrays with the same number of zeroes and ones. I.e. permutations of the original array, however, also the autocorrelation should stay the same/similar to the original so a simple np.permutation does not help me.

Since I'm doing multiple realizations I would need a solution which is as fast as possible. Any help is much appreciated.

Lukas
  • 407
  • 1
  • 4
  • 21
  • I can't think of a very fast solution now. Straightforward seems to identify the number of occurrence per clustersize and then fill a new array with these clusters but at random positions. This would require N iterations where N is the maximum cluster size... – dnalow Sep 15 '16 at 14:41

1 Answers1

2

According to the question to which you refer, you would like to permute x such that

np.corrcoef(x[0: len(x) - 1], x[1: ])[0][1]

doesn't change.

Say the sequence x is composed of

z1 o1 z2 o2 z3 o3 ... zk ok,

where each zi is a sequence of 0s, and each oi is a sequence of 1s. (There are four cases, depending on whether the sequence starts with 0s or 1s, and whether it ends with 0s or 1s, but they're all the same in principle).

Suppose p and q are each permutations of {1, ..., k}, and consider the sequence

zp[1] oq[1] zp[2] oq[2] zp[3] oq[3] ... zp[k] oq[k],

that is, each of the run-length sub-sequences of 0s and 1s have been permuted internally.

For example, suppose the original sequence is

0, 0, 0, 1, 1, 0, 1.

Then

0, 0, 0, 1, 0, 1, 1,

is such a permutation, as well as

0, 1, 1, 0, 0, 0, 1,

and

0, 1, 0, 0, 0, 1, 1.

Performing this permutation will not change the correlation:

  • within each run, the differences are the same
  • the boundaries between the runs are the same as before

Therefore, this gives a way to generate permutations which do not affect the correlation. (Also, see at the end another far simpler and more efficient way which can work in many common cases.)

We start with the function preprocess, which takes the sequence, and returns a tuple starts_with_zero, zeros, ones, indicating, respectively,

  • whether x began with 0
  • The 0 runs
  • The 1 runs

In code, this is

import numpy as np
import itertools

def preprocess(x):
    def find_runs(x, val):
        matches = np.concatenate(([0], np.equal(x, val).view(np.int8), [0]))
        absdiff = np.abs(np.diff(matches))
        ranges = np.where(absdiff == 1)[0].reshape(-1, 2)
        return ranges[:, 1] - ranges[:, 0]

    starts_with_zero = x[0] == 0

    run_lengths_0 = find_runs(x, 0)
    run_lengths_1 = find_runs(x, 1)
    zeros = [np.zeros(l) for l in run_lengths_0]
    ones = [np.ones(l) for l in run_lengths_1]

    return starts_with_zero, zeros, ones

(This function borrows from an answer to this question.)

To use this function, you could do, e.g.,

x = (np.random.uniform(size=10000) > 0.2).astype(int)

starts_with_zero, zeros, ones = preprocess(x)

Now we write a function to permute internally the 0 and 1 runs, and concatenate the results:

def get_next_permutation(starts_with_zero, zeros, ones):
    np.random.shuffle(zeros)
    np.random.shuffle(ones)

    if starts_with_zero:
        all_ = itertools.izip_longest(zeros, ones, fillvalue=np.array([]))
    else:
        all_ = itertools.izip_longest(ones, zeros, fillvalue=np.array([]))
    all_ = [e for p in all_ for e in p]

    x_tag = np.concatenate(all_)

    return x_tag

To generate another permutation (with same correlation), you would use

x_tag = get_next_permutation(starts_with_zero, zeros, ones)

To generate many permutations, you could do:

starts_with_zero, zeros, ones = preprocess(x)

for i in range(<number of permutations needed):
    x_tag = get_next_permutation(starts_with_zero, zeros, ones)

Example

Suppose we run

x = (np.random.uniform(size=10000) > 0.2).astype(int)
print np.corrcoef(x[0: len(x) - 1], x[1: ])[0][1]

starts_with_zero, zeros, ones = preprocess(x)

for i in range(10):
    x_tag = get_next_permutation(starts_with_zero, zeros, ones)

    print x_tag[: 50]
    print np.corrcoef(x_tag[0: len(x_tag) - 1], x_tag[1: ])[0][1]

Then we get:

0.00674330566615
[ 1.  1.  1.  1.  1.  0.  0.  1.  1.  1.  1.  1.  1.  1.  1.  0.  1.  0.
  1.  1.  0.  1.  1.  1.  1.  0.  1.  1.  0.  0.  1.  0.  1.  1.  1.  1.
  0.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.]
0.00674330566615
[ 1.  0.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
  1.  1.  1.  1.  0.  1.  1.  0.  1.  1.  1.  1.  1.  1.  0.  0.  1.  0.
  1.  1.  1.  1.  0.  0.  0.  1.  1.  1.  1.  1.  1.  1.]
0.00674330566615
[ 1.  1.  1.  1.  1.  0.  0.  1.  1.  1.  0.  0.  0.  0.  1.  0.  1.  1.
  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  0.  1.  1.  0.  1.  1.
  1.  1.  1.  1.  1.  1.  0.  1.  0.  0.  1.  1.  1.  0.]
0.00674330566615
[ 1.  1.  1.  1.  0.  1.  0.  1.  1.  1.  1.  1.  1.  1.  0.  1.  1.  0.
  1.  1.  1.  1.  1.  0.  0.  1.  1.  1.  1.  0.  1.  1.  1.  1.  1.  1.
  1.  1.  1.  0.  1.  1.  1.  1.  1.  1.  1.  0.  0.  1.]
0.00674330566615
[ 1.  1.  1.  1.  0.  0.  0.  0.  1.  1.  0.  1.  1.  0.  0.  1.  0.  1.
  1.  1.  0.  1.  0.  1.  1.  0.  1.  1.  1.  1.  1.  1.  1.  0.  0.  1.
  0.  1.  1.  1.  1.  1.  1.  0.  1.  0.  1.  1.  1.  1.]
0.00674330566615
[ 1.  1.  0.  1.  1.  1.  0.  0.  1.  1.  0.  1.  1.  0.  0.  1.  1.  0.
  1.  1.  1.  0.  1.  1.  1.  1.  0.  0.  0.  1.  1.  1.  1.  1.  1.  1.
  0.  1.  1.  1.  1.  0.  1.  1.  0.  1.  0.  0.  1.  1.]
0.00674330566615
[ 1.  1.  0.  0.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
  1.  1.  1.  1.  1.  1.  1.  1.  0.  1.  1.  1.  0.  1.  1.  1.  1.  1.
  1.  1.  0.  1.  0.  1.  1.  0.  1.  0.  1.  1.  1.  1.]
0.00674330566615
[ 1.  1.  1.  1.  1.  1.  1.  0.  1.  1.  0.  1.  1.  0.  1.  0.  1.  1.
  1.  1.  1.  0.  1.  0.  1.  1.  0.  1.  1.  1.  0.  1.  1.  1.  1.  0.
  0.  1.  1.  1.  0.  1.  1.  0.  1.  1.  0.  1.  1.  1.]
0.00674330566615
[ 1.  1.  1.  0.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  0.
  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  0.  1.  1.  1.  0.  1.  1.  1.
  0.  1.  1.  1.  1.  1.  1.  0.  1.  1.  0.  1.  1.  1.]
0.00674330566615
[ 1.  1.  0.  1.  1.  1.  1.  0.  1.  1.  1.  1.  1.  1.  0.  1.  0.  1.
  1.  1.  1.  1.  1.  1.  1.  1.  1.  0.  1.  1.  1.  1.  1.  1.  1.  1.
  1.  1.  0.  1.  0.  1.  0.  1.  1.  1.  1.  1.  1.  0.]

Note that there is a much simpler solution if

  • your sequence is of length n,

  • some number m has m << n, and

  • m! is much larger than the number of permutations you need.

In this case, simply divide your sequence into m (approximately) equal parts, and permute them randomly. As noted before, only the m - 1 boundaries change in a way that potentially affects the correlations. Since m << n, this is negligible.

For some numbers, say you have a sequence with 10000 elements. It is known that 20! = 2432902008176640000, which is far far more permutations than you need, probably. By dividing your sequence into 20 parts and permuting, you're affecting at most 19 / 10000 with might be small enough. For these sizes, this is the method I'd use.

Community
  • 1
  • 1
Ami Tavory
  • 74,578
  • 11
  • 141
  • 185
  • Nice answer, but one thing worth pointing out is that "permuting sequences" does not give all possible solutions. E.g., for `[0, 1, 1, 1, 0, 1, 0]`, you will not get `[0, 1, 1, 0, 1, 1, 0]`. – Ulrich Stern Sep 19 '16 at 14:14
  • @UlrichStern That's an excellent point - thanks! I tried to find a good tradeoff point between generating a large number of permutations, and fast generation time. This is a very good point, though, and I'll add it to the answer (if it's OK with you). – Ami Tavory Sep 19 '16 at 14:23