How select random sampling from large dict whitout converting to list in Python

Question

I have a (potentially) huge dict in Python 3.10 and want to randomly sample a few values. Alas, random.sample(my_dict, k) says:

TypeError: Population must be a sequence.  For dicts or sets, use sorted(d).

and random.sample(my_dict.keys(), k) gives

DeprecationWarning: Sampling from a set deprecated
since Python 3.9 and will be removed in a subsequent version.

I don't want to pay the cost of converting my dictionary keys to a list, and I don't need them sorted.

There's an old question in a similar vein, but that's from before stuff got deprecated in Python and the person asking that question didn't mind converting to a list first.

I also tried running random.choice multiple times to simulate random.sample. But that's even worse: it just throws an exception when you use it on a dict. (Instead of giving you a reasonable error message.)

Best I can think of without converting to a list is creating a random list of booleans (or 0/1) and using it as an indicator if to take an element or not. Alternatively, loop over the dict and "flip a coin" on each item until you reach the desired amount. While not creating a list that would still be O(n) as you iterate the whole dict — Tomerikoo, Jul 17 '22 at 09:19
@Tomerikoo The indicator approach works, but is O(n). I'd like something in O(k). (Assuming k << n here and k being fixed.) For context: I want to take a sample to find some approximate quantiles. — Matthias, Jul 17 '22 at 14:41
Have you read the error message at all? You at least aren’t paying for a sort. You need a sequence and MUST make a list! Build it once and save it. — Mark Tolonen, Jul 17 '22 at 15:00
I really don't see how you can avoid O(n) but would be happy to be proven otherwise. If you work on the indicator approach, you still need to iterate the dict because there is no indexed access. If you want to access using keys, you need to sample the keys which means creating a list of them to sample — Tomerikoo, Jul 17 '22 at 18:42
@Tomerikoo https://github.com/matthiasgoergens/python-indexed-access has a proof of concept for how to avoid O(n) runtime for random sampling from a dictionary in Python. It's written as a C extension, because that one leaks a crucial implementation detail that we need that the pure Python interface of dicts doesn't provide. (I've signed up to the Python developers mailing list to see whether I can convince them to let me add machinery to do random sampling from dicts and sets. It's embarrassing that Python doesn't support this properly.) — Matthias, Jul 26 '22 at 01:21

I'mahdi · Answer 1 · 2022-07-17T10:46:18.623

2

You need to use sample(list(dct)). (The example code select random 2 items from original dict.)

from random import sample
dct = {'a':1, 'b':2, 'c':3, 'd':4}

rnd_keys = sample(list(dct), 2)
# rnd_keys -> ['c', 'b']
rnd_dct = dict(sample(list(dct.items()), 2))
print(rnd_dct)

{'c': 3, 'b': 2}

Update without converting huge dict to list (this convert use O(n) space and question say, don't do this.). You can generate random number base len(dict) and use enumerate and only get k,v that idx match with random_idx and break from for-loop when reaching to zero base random_number that we want to select (this break helps you don't see all dict).

from random import sample
dct = {'a':1, 'b':2, 'c':3, 'd':4}
# idx -^0^----^1^----^2^----^3^---
number_rnd = 2
rnd_idx = set(sample(range(len(dct)), number_rnd))
print(rnd_idx)
# {0, 3}
res = {}
for idx, (k,v) in enumerate(dct.items()):
    if idx in rnd_idx:
        res[k] = v
        number_rnd -= 1
        if number_rnd == 0:
            break
print(res)
# {'b': 2, 'c': 3}

Third Approach By thanks Tomerikoo, We can use flip a coin idea, On each iterate over items() we cen generate a random 0 or 1 and if the random number is 1 save the item in the result dict. (Maybe we see all dict items but don't select all random numbers because, maybe we get many random 0.)

import random
dct = {'a':1, 'b':2, 'c':3, 'd':4}

number_rnd = 2
res = {}
for k,v in dct.items():
    rnd_ch = random.getrandbits(1)
    if rnd_ch:
        res[k] = v
        number_rnd -= 1
        if number_rnd == 0:
            break

print(res)

edited Jul 17 '22 at 10:46

answered Jul 17 '22 at 08:31

I'mahdi

23,382
5
22
30

This gives `DeprecationWarning: Sampling from a set deprecated since Python 3.9 and will be removed in a subsequent version.` The `items()` should also be converted to a list – Tomerikoo Jul 17 '22 at 09:02
@Tomerikoo, edited, thanks, what is your python version? I check and don't get warning. – I'mahdi Jul 17 '22 at 09:06
I was running Python 3.9 so if you're using anything lower it makes sense as the warning is about 3.9 – Tomerikoo Jul 17 '22 at 09:07
For the record, running with 3.10 gave the same warning and not an error – Tomerikoo Jul 17 '22 at 09:08
@Tomerikoo, I check on python 3.8.12, give me time to check with newer version ;) – I'mahdi Jul 17 '22 at 09:11
I don't want to pay the O(n) cost to convert my huge dict into a list. I want the sample in about O(sample size) regardless of the size of the dict. – Matthias Jul 17 '22 at 09:13
1

@Matthias, edited answer, see edit part. – I'mahdi Jul 17 '22 at 09:55
This will still be `O(N*k)` where `k` is the number of samples. For each element in the dict you check if it is in `rnd_idx`. A better approach would be to iterate the dict, ["flip a coin"](https://stackoverflow.com/questions/6824681/get-a-random-boolean-in-python) on each element, and `break` when you reach enough elements. – Tomerikoo Jul 17 '22 at 10:00
@Tomerikoo, this `min(O(N,k))` if k len of `dict` we should see all `dict` but if be 0,1,2 we only see three first item of `dict` – I'mahdi Jul 17 '22 at 10:01
No. First of all, `N > k` so this makes no sense I believe. And second imagine if one of the indexes is the last one - you iterate over the whole dict (`O(N)`) and for each element you check if its index is in `rnd_idx` (`O(k)` as `rnd_idx` is a list). That's a total of `O(N*k)` (worst-case) – Tomerikoo Jul 17 '22 at 10:03
@Tomerikoo, I generate random number not generate random keys. please see `update` part. – I'mahdi Jul 17 '22 at 10:03
@Tomerikoo, why do you say `O(N*k)` I only one time iterate over `dict` – I'mahdi Jul 17 '22 at 10:04
@Tomerikoo, and use `enumerate` if idx `items()` exist in `rnd_idx` i save this item or continue – I'mahdi Jul 17 '22 at 10:05
I can see that. Again, you have a list of `k` indexes. You iterate over the dict and check if the element's index is in the list of indexes. That's iterating a list of `k` elements for each of the `N` elements in the dict - `O(N*k)`. Also you are doing `max(rnd_idx)` for each element which is also not efficient – Tomerikoo Jul 17 '22 at 10:05
@Tomerikoo, OK, I understand, what do you say, we need to know if k is constant or not, if `k< – I'mahdi Jul 17 '22 at 10:08
I agree that if `k< – Tomerikoo Jul 17 '22 at 10:10
1

@Tomerikoo, If OP has `len(dct) = 10_000_000` and want select 1_000 random number, the second approach use `O(N) in time` but O(1) in space. but the first approach has `O(N)` in space and time. – I'mahdi Jul 17 '22 at 10:10
@Tomerikoo, Oh thanks, I read about "flip a coin", If I can add to my solution. – I'mahdi Jul 17 '22 at 10:11
Yeah but I just figured out it has a major flaw - when flipping a coin, you could theoratically end up with not enough values... What if all flips were negative? You will iterate the dict and not "collect" enough samples. While this is unlikely to happen (especially if `k< – Tomerikoo Jul 17 '22 at 10:13
@Tomerikoo, Oh thanks, "flip a coin" is a good idea, I add this as third approach – I'mahdi Jul 17 '22 at 10:15
I would just avoid the `max(rnd_idx)` call. You already know how many elements you want (in your example `2`). You can do `if len(res) == 2: break` or even to avoid the `len` call you can maintain a variable called `remaining = 2` and decrease it every time an item is added. Then just do `if remaining == 0: break` – Tomerikoo Jul 17 '22 at 10:15
You can also convert `rnd_idx` to a set for better look-up time (`O(1)`) – Tomerikoo Jul 17 '22 at 10:16
@Tomerikoo, thanks very much for your good advice, but I think random.sample don't generate repeated elements, Am I wrong? – I'mahdi Jul 17 '22 at 10:23
No, but searching in a list is `O(N)` (size of list), while searching in set is always `O(1)`. Last tip, you can actually put the `if number_rnd == 0:` ***inside*** the `if idx in rnd_idx:` condition, after decreasing. This way you check less times and even less operations :) – Tomerikoo Jul 17 '22 at 10:25
@Tomerikoo, thanks, Today I learn from you :) I will add "flip a coin" some minutes later – I'mahdi Jul 17 '22 at 10:26
@Tomerikoo, edited answer and add a third approach, Do I write your name or not in the solution? – I'mahdi Jul 17 '22 at 10:46
Is the goal to sample uniformly (each element having the same probability of being in the sample)? If so, then flipping a coin would not work, unless the order of the elements is uniformly random, but if that's the case you can just take the first k. – Joe Jun 08 '23 at 20:21

score 0 · Answer 2 · answered Jul 17 '22 at 08:35

0

If you’re ok with getting the keys into a list, you can just pick the keys using a random set of integers off of the list. If you don’t want to store them into a list, you can generate your random integers based on the dict size, sort them, then iterate through the dict and sample as you get to indices matching your random integer picks.

answered Jul 17 '22 at 08:35

farimani

1
1

Alas, I am not ok with getting the keys into a list nor sorting them. I want something that runs in about O(sample size) for large dicts. – Matthias Jul 17 '22 at 09:08
O(n) is best you can do unless you store a list of keys as you build the dict. Note that you wouldn’t sort the keys. You’d sort your list of random indices, then iterate the keys and consume the indices in order. – farimani Jul 17 '22 at 16:21
Sorry, I don't understand how O(n) is the best I can do. – Matthias Jul 18 '22 at 23:56

How select random sampling from large dict whitout converting to list in Python

2 Answers2