Problems while creating a random numpy vector with unique elements?

Question

I am creating lists like this:

In:

x = [[[random.randint(1,1000), random.randint(0,2)] for i,j in zip(range(1), range(1))][0] for i in range(5)]

Out:

[[76, 0], [128, 2], [194, 2], [851, 2], [123, 1]]

However, what is the most efficient way of making the first element of the sublist unique? It seems that randint doesnt have this option. How can I force 76, 128, 194, 851, 123 to be unique?

Are you interested in a numpy solution? The question is tagged as such, but you also seem to deal with lists and the `random` module. — Reti43, May 26 '21 at 14:59

akilat90 · Accepted Answer · 2021-05-26T18:04:52.863

You can use ~~np.random.choice~~ np.random.default_rng().choice with replace=False to ensure uniqueness.

import numpy as np

first = np.random.choice(np.arange(1000, dtype=int), 5, replace=False)  # replace=False ensures uniqueness
# first = array([645, 543, 233,  93, 420])

second = np.random.choice([1, 2], 5)
# second = array([1, 1, 2, 1, 2])

Using the np.array to combine the two and taking the trasnpose:

np.array((first, second)).T.tolist()
# [[645, 1], [543, 1], [233, 2], [93, 1], [420, 2]]

Update:

Based on the comment by @Sam Mason and according to this thread, seems like the preferred way since numpy 1.17 is to use rng = np.random.default_rng().

So, the variable first shall be changed to:

rng = np.random.default_rng()
first = rng.choice(np.arange(1000, dtype=int), 5, replace=False)

Timing Comparison

This is a rough timing comparison for two large values. If a proper comparison is needed, you need to run this for many combinations of the array length and the range to pick from. Feel free to edit this as the new answers appear.

length, max_val = 100000, 10000000


%timeit op(length, max_val)
%timeit akilat90(length, max_val)
%timeit Reti43_np(length, max_val)
%timeit Reti43_p(length, max_val)
%timeit Shivam_Roy(length, max_val)

# 1 loop, best of 5: 392 ms per loop
# 10 loops, best of 5: 45.4 ms per loop
# 1 loop, best of 5: 13.8 s per loop
# 1 loop, best of 5: 261 ms per loop
# 1 loop, best of 5: 364 ms per loop

Code to reproduce:

def op(length, max_val):
    """
    [0, max_val) range is considered to get the first values
    """
    if max_val < length:
        raise ValueError("Can't ensure uniqueness")
    return [[[random.randint(1,max_val), random.randint(0,2)] for i,j in zip(range(1), range(1))][0] for i in range(length)]

def akilat90(length, max_val):
    if max_val < length:
        raise ValueError("Can't ensure uniqueness")
    value_range = np.arange(max_val)
    rng = np.random.default_rng()

    first = rng.choice(value_range, length, replace=False)
    second = rng.choice([1, 2], length)
    return np.array((first, second)).T.tolist()

def Reti43_np(length, max_val):
    if max_val < length:
        raise ValueError("Can't ensure uniqueness")    
    a = np.arange(max_val)[:,None]
    np.random.shuffle(a)
    a = a[:length]
    b = np.random.randint(0, 3, (length, 1))
    out = np.hstack([a, b])
    return out

def Reti43_p(length, max_val):
    if max_val < length:
        raise ValueError("Can't ensure uniqueness")
    a = random.sample(range(1, max_val + 1), length)
    b = [random.randint(0, 2) for _ in range(length)]
    # If you want a list of lists instead `[[first, second] for first, second in zip(a, b)]`
    return list(zip(a, b))    

def Shivam_Roy(length, max_val):
    if max_val < length:
        raise ValueError("Can't ensure uniqueness")
    rand_list = random.sample(range(0, max_val), length)
    return [[[rand_list[x], random.randint(0,2)] for i,j in zip(range(1), range(1))][0] for x in range(length)]

I keep forgetting that `numpy.random.choice` supports sampling with no replacement because I tend to think in parallels of `random.choice` and `random.sample`. — Reti43, May 26 '21 at 15:23
@Reti43 it's quite useful! Also, please check the timing code - I might have made a mistake there. Feel free to edit if that's the case. — akilat90, May 26 '21 at 15:56
@akilat90 your version is so fast because it's ignoring the parameters! also note that using choice from the non-legacy RNG has much better performance that the interface you're using — Sam Mason, May 26 '21 at 16:18
@SamMason Damn! Thanks for pointing out. What is the non-legacy RNG? — akilat90, May 26 '21 at 16:59
@SamMason I think I've figured it out from [this post](https://stackoverflow.com/questions/40914862/why-is-random-sample-faster-than-numpys-random-choice). I was not very active on SO for a while and now it feels like I'm from the stone age! — akilat90, May 26 '21 at 18:09
@akilat90 yup, that looks right. you can also just pass `max_val` directly to it, which makes things another 10 times faster for me — Sam Mason, May 27 '21 at 08:20
just realised that I was testing without `tolist` which now takes the vast majority of the runtime, so you'll only see things improving by 2x — Sam Mason, May 27 '21 at 08:23

Shivam Roy · Answer 2 · 2021-05-26T15:21:43.890

1

You can use random.sample to get unique values from a range, likewise:

rand_list = random.sample(range(1, 10000), 5)

x = [[[rand_list[x], random.randint(0,2)] for i,j in zip(range(1), range(1))][0] for x in range(5)]

edited May 26 '21 at 15:21

answered May 26 '21 at 14:55

Shivam Roy

1,961
3
10
23

1

Thanks! However, I got TypeError: Population must be a sequence or set. For dicts, use list(d). – J Do May 26 '21 at 14:58
1

I'm sorry, I just made an edit, the input parameters have a specific format. Please refer the documentation as well, for more information. – Shivam Roy May 26 '21 at 15:00
1

Apologies for the trouble, I also realised that `random.sample` returns a `list`. Please refer to the edit. – Shivam Roy May 26 '21 at 15:04
This is a bad application of `random.sample()`. Yes, one call gets k unique elements, but if you call it N times, you don't guarantee there won't be duplicates within calls. If you want to use that function, you need to generate all k elements upfront and them zip them with the 0-2 values for the second columns. – Reti43 May 26 '21 at 15:07
@JDo Hi, as Reti43 pointed out, my code was not guaranteeing random values. I have edited the code to make sure that you always get random values. I really hope this helps, and apologise for so many edits. But this one would work perfectly. Also, please note that I have used a variable `x` instead of using `i` twice. – Shivam Roy May 26 '21 at 15:23

Reti43 · Answer 3 · 2021-05-26T15:17:03.463

In order to get random, but unique elements, shuffle your list and take the first N elements.

import numpy as np

rows = 5
a = np.arange(1, 1001)[:,None]
np.random.shuffle(a)
a = a[:rows]
b = np.random.randint(0, 3, (rows, 1))
out = np.hstack([a, b])

Result

array([[  3,   1],
       [291,   1],
       [159,   1],
       [814,   0],
       [989,   2]])

For a pure python solution you can use random.sample to generate unique elements from a collection.

import random

a = random.sample(range(1, 1001), rows)
b = [random.randint(0, 2) for _ in range(rows)]
# If you want a list of lists instead `[[first, second] for first, second in zip(a, b)]`
out = list(zip(a, b))

@J Do If you don't have a lot of items, you shouldn't be concerned about speed. If that is a requirement you should put it in the question. — Reti43, May 26 '21 at 15:18

Problems while creating a random numpy vector with unique elements?

3 Answers3

Timing Comparison