36

I have a an array that is relatively sparse, and I would like to go through each row and shuffle only the non-zero elements.

Example Input:

[2,3,1,0]
[0,0,2,1]

Example Output:

[2,1,3,0]
[0,0,1,2]

Note how the zeros have not changed position.

To shuffle all elements in each row (including zeros) I can do this:

for i in range(len(X)):
    np.random.shuffle(X[i, :])

What I tried to do then is this:

for i in range(len(X)):
    np.random.shuffle(X[i, np.nonzero(X[i, :])])

But it has no effect. I've noticed that the return type of X[i, np.nonzero(X[i, :])] is different from X[i, :] which might be the cause.

In[30]: X[i, np.nonzero(X[i, :])]
Out[30]: array([[23,  5, 29, 11, 17]])

In[31]: X[i, :]
Out[31]: array([23,  5, 29, 11, 17])
Divakar
  • 218,885
  • 19
  • 262
  • 358
Bar
  • 2,736
  • 3
  • 33
  • 41

7 Answers7

15

You could use the non-inplace numpy.random.permutation with explicit non-zero indexing:

>>> X = np.array([[2,3,1,0], [0,0,2,1]])
>>> for i in range(len(X)):
...     idx = np.nonzero(X[i])
...     X[i][idx] = np.random.permutation(X[i][idx])
... 
>>> X
array([[3, 2, 1, 0],
       [0, 0, 2, 1]])
randomir
  • 17,989
  • 1
  • 40
  • 55
8

I think I found the three-liner?

i, j = np.nonzero(a.astype(bool))
k = np.argsort(i + np.random.rand(i.size))
a[i,j] = a[i,j[k]]
Daniel F
  • 13,620
  • 2
  • 29
  • 55
  • Yup, pretty close! Might be better than what I had in mind in fact. – Divakar Aug 22 '17 at 08:42
  • `rand` is certainly faster than `permute`, but for some reason I can't quite grasp `nonzero` seems to be slightly slower than `where`. – MB-F Aug 22 '17 at 08:52
  • Apparently `np.nonzero(a!=0)` seemed a bit faster than `np.nonzero(a)`. So, I have used that instead in the benchmarking post. – Divakar Aug 22 '17 at 10:01
  • Thanks, running off on holiday so won't be able to tweak it any more than that! – Daniel F Aug 22 '17 at 12:16
  • `a.astype(bool)` seems to be a smidge faster on average (although the improvement seems smaller than the variance in `%timeit` from run to run) – Daniel F Aug 22 '17 at 12:23
  • Thanks @Divakar, that was fun :) Sorry I couldn't get you my timings last night. – Daniel F Aug 29 '17 at 07:27
  • That's okay! You deserved it! It was fun timings those too, a bit tedious work, but worth the new knowledge of benchmarking stuffs :) – Divakar Aug 29 '17 at 08:15
8

Benchmarking for the vectorized solutions

We are looking to benchmark vectorized solutions in this post. Now, the vectorization tries to avoid the looping that we would loop through each row and do the shuffling. So, the setup for the input array involves a greater number of rows.

Approaches -

def app1(a): # @Daniel F's soln
    i, j = np.nonzero(a.astype(bool))
    k = np.argsort(i + np.random.rand(i.size))
    a[i,j] = a[i,j[k]]
    return a

def app2(x): # @kazemakase's soln
    r, c = np.where(x != 0)
    n = c.size
    perm = np.random.permutation(n)
    i = np.argsort(perm + r * n)
    x[r, c] = x[r, c[i]]
    return x

def app3(a): # @Divakar's soln
    m,n = a.shape
    rand_idx = np.random.rand(m,n).argsort(axis=1)
    b = a[np.arange(m)[:,None], rand_idx]
    a[a!=0] = b[b!=0]
    return a

from scipy.ndimage.measurements import labeled_comprehension
def app4(a): # @FabienP's soln
    def func(array, idx):
        r[idx] = np.random.permutation(array)
        return True
    labels, idx = nz = a.nonzero()
    r = a[nz]
    labeled_comprehension(a[nz], labels + 1, np.unique(labels + 1),\
                                func, int, 0, pass_positions=True)
    a[nz] = r
    return a

Benchmarking procedure #1

For a fair benchmarking, it seemed reasonable to use OP's sample and simply stack those as more rows to get a bigger dataset. Thus, with that setup we could create two cases with 2 million and 20 million rows datasets.

Case #1 : Large dataset with 2*1000,000 rows

In [174]: a = np.array([[2,3,1,0],[0,0,2,1]])

In [175]: a = np.vstack([a]*1000000)

In [176]: %timeit app1(a)
     ...: %timeit app2(a)
     ...: %timeit app3(a)
     ...: %timeit app4(a)
     ...: 
1 loop, best of 3: 264 ms per loop
1 loop, best of 3: 422 ms per loop
1 loop, best of 3: 254 ms per loop
1 loop, best of 3: 14.3 s per loop

Case #2 : Larger dataset with 2*10,000,000 rows

In [177]: a = np.array([[2,3,1,0],[0,0,2,1]])

In [178]: a = np.vstack([a]*10000000)

# app4 skipped here as it was slower on the previous smaller dataset
In [179]: %timeit app1(a)
     ...: %timeit app2(a)
     ...: %timeit app3(a)
     ...: 
1 loop, best of 3: 2.86 s per loop
1 loop, best of 3: 4.62 s per loop
1 loop, best of 3: 2.55 s per loop

Benchmarking procedure #2 : Extensive one

To cover all cases of varying percentage of non-zeros in the input array, we are covering some extensive benchmarking scenarios. Also, since app4 seemed much slower than others, for a closer inspection we are skipping that one in this section.

Helper function to setup input array :

def in_data(n_col, nnz_ratio):
    # max no. of elems that my system can handle, i.e. stretching it to limits.
    # The idea is to use this to decide the number of rows and always use
    # max. possible dataset size
    num_elem = 10000000

    n_row = num_elem//n_col
    a = np.zeros((n_row, n_col),dtype=int)
    L = int(round(a.size*nnz_ratio))
    a.ravel()[np.random.choice(a.size, L, replace=0)] = np.random.randint(1,6,L)
    return a

Main timing and plotting script (Uses IPython magic functions. So, needs to be run opon copying and pasting onto IPython console) -

import matplotlib.pyplot as plt

# Setup input params
nnz_ratios = np.array([0.2, 0.4, 0.6, 0.8])
n_cols = np.array([4, 5, 8, 10, 15, 20, 25, 50])

init_arr1 = np.zeros((len(nnz_ratios), len(n_cols) ))
init_arr2 = np.zeros((len(nnz_ratios), len(n_cols) ))
init_arr3 = np.zeros((len(nnz_ratios), len(n_cols) ))

timings = {app1:init_arr1, app2:init_arr2, app3:init_arr3}
for i,nnz_ratio in enumerate(nnz_ratios):
    for j,n_col in enumerate(n_cols):
        a = in_data(n_col, nnz_ratio=nnz_ratio)
        for func in timings:
            res = %timeit -oq func(a)
            timings[func][i,j] = res.best
            print func.__name__, i, j, res.best

fig = plt.figure(1)
colors = ['b','k','r']
for i in range(len(nnz_ratios)):
    ax = plt.subplot(2,2,i+1)
    for f,func in enumerate(timings):
        ax.plot(n_cols, 
                [time for time in timings[func][i]], 
                label=str(func.__name__), color=colors[f])
    ax.set_xlabel('No. of cols')
    ax.set_ylabel('time [seconds]')
    ax.grid(which='both')
    ax.legend()
    plt.tight_layout()
    plt.title('Percentage non-zeros : '+str(int(100*nnz_ratios[i])) + '%')
plt.subplots_adjust(wspace=0.2, hspace=0.2)

Timings output -

enter image description here

Observations :

  • Approaches #1, #2 does argsort on the non-zero elements across the entire input array. As such, it performs better with lesser percentage of non-zeros.

  • Approach #3 creates random numbers of the same shape as the input array and then gets argsort indices per row. Thus, for a given number of non-zeros in the input, the timings for it are more steep-ish than first two approaches.

Conclusion :

Approach #1 seems to be doing pretty well until 60% non-zero mark. For more non-zeros and if the row-lengths are small, approach #3 seems to be performing decently.

Divakar
  • 218,885
  • 19
  • 262
  • 358
  • Huh. On my machine my algorithm is about 25% faster than yours for the smaller dataset. I'm coming in at `205ms` and yours is about `255ms`. Yours scales better though. – Daniel F Aug 28 '17 at 07:47
  • @DanielF Could you share your timings through pastebin for the two datasets for the four approaches? – Divakar Aug 28 '17 at 08:02
  • Kinda wishing there was a native insertion sort implementation for `numpy.argsort` - that would probably get me over the line on larger datasets, as my index array starts nearly sorted. – Daniel F Aug 28 '17 at 08:02
  • Not from work unfortunately. I can try from my home computer but it will likely be different results. – Daniel F Aug 28 '17 at 08:03
  • @DanielF Or just copy n paste the timing results on here as comments? – Divakar Aug 28 '17 at 08:04
  • Hmm, when I reinitialized the random `a` I jumped back to something close to your timings. Must have been something weird with my first random set. – Daniel F Aug 28 '17 at 08:14
  • It looks like sparsity. If I do `a = np.random.randint(0,8,(1000000,n))` and `a = np.where(a<=3,0,a-3)`, my timings are better than yours. Although to get a random set that sparse with just `randint` would have been a miracle. – Daniel F Aug 28 '17 at 08:16
  • Wait, no, now you're faster again. And kaze's solution beats both of us >.< I give up. Gotta work :) – Daniel F Aug 28 '17 at 08:20
  • @DanielF Added `Timings` section for the timings. Hope the setup looks okay. Would love to see timings from your end, just to confirm those in this post. – Divakar Aug 28 '17 at 09:01
8

As promised, this being the fourth day of the bounty period, here's my attempt at a vectorized solution. The steps involved are explained in some details below :

  • For easy reference, let's call the input array as a. Generate unique indices per row that covers the range for row length. For this, we can simply generate random numbers of the same shape as the input array and get the argsort indices along each row, which would be those unique indices. This idea has been explored before in this post.

  • Index into each row of input array with those indices as columns indices. Thus, we would need advanced-indexing here. Now, this gives us an array with each row being shuffled. Let's call it b.

  • Since the shuffling is restricted to per row, if we simply use the boolean-indexing : b[b!=0], we would get the non-zero elements being shuffled and also being restricted to lengths of non-zeros per row. This is because of the fact that the elements in a NumPy array are stored in row-major order, so with boolean-indexing it would have selected shuffled non-zero elements on each row first before moving onto the next row. Again, if we use boolean-indexing similarly for a, i.e. a[a!=0], we would have similarly gotten the non-zero elements on each row first before moving onto the next row and these would be in their original order. So, the final step would be to just grab masked elements b[b!=0] and assign into the masked places a[a!=0].

Thus, an implementation covering the above mentioned three steps would be -

m,n = a.shape
rand_idx = np.random.rand(m,n).argsort(axis=1) #step1
b = a[np.arange(m)[:,None], rand_idx]          #step2  
a[a!=0] = b[b!=0]                              #step3 

A sample step-by-step run might make things clearer -

In [50]: a # Input array
Out[50]: 
array([[ 8,  5,  0, -4],
       [ 0,  6,  0,  3],
       [ 8,  5,  0, -4]])

In [51]: m,n = a.shape # Store shape information

# Unique indices per row that covers the range for row length
In [52]: rand_idx = np.random.rand(m,n).argsort(axis=1)

In [53]: rand_idx
Out[53]: 
array([[0, 2, 3, 1],
       [1, 0, 3, 2],
       [2, 3, 0, 1]])

# Get corresponding indexed array
In [54]: b = a[np.arange(m)[:,None], rand_idx]

# Do a check on the shuffling being restricted to per row
In [55]: a[a!=0]
Out[55]: array([ 8,  5, -4,  6,  3,  8,  5, -4])

In [56]: b[b!=0]
Out[56]: array([ 8, -4,  5,  6,  3, -4,  8,  5])

# Finally do the assignment based on masking on a and b
In [57]: a[a!=0] = b[b!=0]

In [58]: a # Final verification on desired result
Out[58]: 
array([[ 8, -4,  0,  5],
       [ 0,  6,  0,  3],
       [-4,  8,  0,  5]])
Divakar
  • 218,885
  • 19
  • 262
  • 358
4

I came up with that:

nz = a.nonzero()                      # Get nonzero indexes
a[nz] = np.random.permutation(a[nz])  # Shuffle nonzero values with mask

Which look simpler (and a little bit faster?) than other proposed solutions.


EDIT: new version that does not mix rows

 labels, *idx = nz = a.nonzero()                                    # get masks
 a[nz] = np.concatenate([np.random.permutation(a[nz][labels == i])  # permute values
                         for i in np.unique(labels)])               # for each label

Where the first array of a.nonzero() (indexes of non zero values in axis0) is used as labels. This is the trick that does not mix rows.

Then np.random.permutation is applied on a[a.nonzero()] for each "label" independently.

Supposedly scipy.ndimage.measurements.labeled_comprehension can be used here, by it seems to fail with np.random.permutation.

And I finally saw that it looks a lot like what @randomir proposed. Anyway, it was just for the challenge of getting it to work.


EDIT2:

Finally got it working with scipy.ndimage.measurements.labeled_comprehension

def shuffle_rows(a):
    def func(array, idx):
        r[idx] = np.random.permutation(array)
        return True
    labels, *idx = nz = a.nonzero()
    r = a[nz]
    labeled_comprehension(a[nz], labels + 1, np.unique(labels + 1), func, int, 0, pass_positions=True)
    a[nz] = r
    return a

Where:

  1. func() shuffles the non zero values
  2. labeled_comprehension applies func() label-wise

This replaces the previous for loop and will be faster on arrays with many rows.

FabienP
  • 3,018
  • 1
  • 20
  • 25
  • The latest edit seems to be working! Added timings for it in the community wiki post. I had to make edit at this line though : `labels, idx = nz = a.nonzero()`. Let me know if that was a wrong edit. – Divakar Aug 28 '17 at 04:52
  • @Divakar: Thanks! The `*idx` makes it compatible with n-D arrays, but it is not compatible with all python versions, and this feature was not mentioned in the question. So your edit is OK for me. – FabienP Aug 28 '17 at 08:15
3

This is one possibility for a vectorized solution:

r, c = np.where(x > 0)
n = c.size

perm = np.random.permutation(n)
i = np.argsort(perm + r * n)

x[r, c] = x[r, c[i]]

The challenge in vectorizing this problem is that np.random.permutation gives only flat indices, which would shuffle the array elements across rows. Sorting the permuted values with an offset added makes sure no shuffling across rows occurs.

MB-F
  • 22,770
  • 4
  • 61
  • 116
3

Here's your two liner without needing to install numpy.

from random import random

def shuffle_nonzeros(input_list):
    ''' returns a list with the non-zero values shuffled '''
    shuffled_nonzero = iter(sorted((i for i in input_list if i!=0), key=lambda k: random()))
    print([i for i in (i if i==0 else next(shuffled_nonzero) for i in input_list)])

if you dont like one liners though, you can either make this a generator with

def shuffle_nonzeros(input_list):
    ''' generator that yields a list with the non-zero values shuffled '''
    random_nonzero_values = iter(sorted((i for i in input_list if i!=0), key=lambda k: random()))
    for i in iterable:
        if i==0:
            yield i
        else:
            yield next(random_nonzero_values)

or if you want a list as the output, and dont like one line comprehensions

def shuffle_nonzeros(input_list):
    ''' returns a list with the non-zero values shuffled '''
    out = []
    random_nonzero_values = iter(sorted((i for i in input_list if i!=0), key=lambda k: random()))
    for i in iterable:
        if i==0:
            out.append(i)
        else:
            out.append(next(random_nonzero_values))
    return out
codykochmann
  • 347
  • 1
  • 10