Using Python 3.7.3, randomly select / choose from weighted list of files in a given directory

Question

Using Python 3.7.3, I need to randomly choose from a weighted list of files in a given directory. Weights are determined by how new the file is and whether or not the user has marked as favorite (newer the file, more often it is selected.)

What is the most efficient way to set the weights? I want the behavior of my distribution of randomly chosen elements is the same as the distribution of weights in the list. The favorite flag will be stored in a dictionary of with the file names as the key, and true/false as the value.

Assume the number of items in the weights list must equal the number of elements in filesList, and that the list of weights must collectively add up to 1. Also, this is being run on a Raspberry Pi 3/4.

If another method is better than numpy.random.choice, I'm all for it.

I've looked into Randomly selecting an element from a weighted list.

import numpy, collections

#filesList = os.listdir('./assets')    

filesList= ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o'] 

count = len(filesList)

Favorites = {}
for key in Listy:
    Favorites[key] = random.choice([True, False])

weights = [   0.01666666666666667,
0.02666666666666667,
0.03666666666666667,
0.04666666666666667,
0.05666666666666667,
0.06666666666666667,
0.06666666666666667,
0.06666666666666667,
0.06666666666666667,
0.06666666666666667,
0.07666666666666667,
0.08666666666666667,
0.09666666666666667,
0.10666666666666667,
0.11666666666666667]

# Currently the code for setting the weights is commented out, as I'm not sure how to do it. Above is an example of distribution. 

#weights = [0 for x in range(count)]
#for x in range(count):
#    #offset = ?
#    weights[x-1] = 1/count #+ offset


print(f'Favorites: {Favorites}')
print('weights', weights)    

sum = 0     #sum of all weight values must be 1
for weight in weights:
    sum += weight

print(f'sum of weights: {sum}')

l = [numpy.random.choice(filesList, p=weights) for _ in range(10000)]

print(f'Results: {collections.Counter(l)}')

Which part are you struggling with? Computing the weights or the choice? — Finomnis, Jul 02 '19 at 20:41
You don't. For [random.choices()](https://docs.python.org/3.6/library/random.html#random.choices) you don't need them to sum up to 1. — Finomnis, Jul 02 '19 at 20:44
ok, but how would you get the weights, factoring in age of the file and favorite flag? — Jakar510, Jul 02 '19 at 20:46
@Jakar510 are you asking how to query file attributes like last modified or created date and some potential "favorite" metadata? — Aaron, Jul 02 '19 at 20:48
Wow. I thought this is about actually doing the random choice. Then I don't get the question. There is no 'optimal way' to do that. Just like ... do some heuristics or something ... something like exponential function for the age and them some linear factor for the favourite... It's all personal taste — Finomnis, Jul 02 '19 at 20:50
However random.choices(filesList, weights=weights) did help though. not trying to keep it to a sum of 1 is a huge help. — Jakar510, Jul 02 '19 at 20:53

Finomnis · Accepted Answer · 2019-07-02T20:57:05.077

3

Since python 3.6, there is random.choices(), which accepts weights.

The weights don't have to be normalized, meaning, they don't have to sum up to 1.

import random

choices = random.choices(filesList, weights=weights)
print(choices[0])

EDIT: now that I realized that the question is about the actual weights, some suggestion. For every file, compute a weight like that:

def compute_weight(age_in_days, favorite):
    age_factor = 1 #set to how much you want age to matter
    favorite_factor = 1 #set how much you want the favorite to matter
    if favorite:
        return math.exp(-age_in_days*age_factor) * (favorite_factor+1)
    else:
        return math.exp(-age_in_days*age_factor)

or something similar. Maybe add the favorite_factor instead of multiplying it, just play around with it.

edited Jul 02 '19 at 20:57

answered Jul 02 '19 at 20:43

Finomnis

18,094
1
20
27

Interesting. I'll give a try. Just curious, why would it be negative? is that required or? – Jakar510 Jul 02 '19 at 20:56
I think you are mixing the exponential function and the normal distribution here. Otherwise I don't have any idea what you are talking about. – Finomnis Jul 02 '19 at 21:04
this will work. Thank you. Note, for those that need it, it returns a list, so if you want the result, do random.choices(filesList, weights=weights)[0]. – Jakar510 Jul 02 '19 at 21:08
Deleted previous comment, because I started to get salty. Sorry. Again, here the part with the exp: exp(-x) is called exponential decay, exp(x) is exponentail growth. Here is a [plot](http://fooplot.com/?lang=de#W3sidHlwZSI6MCwiZXEiOiJleHAoeCkiLCJjb2xvciI6IiNGRjAwMDAifSx7InR5cGUiOjAsImVxIjoiZXhwKC14KSIsImNvbG9yIjoiIzAwRkY0NCJ9LHsidHlwZSI6MTAwMH1d) to illustrate what I mean. – Finomnis Jul 02 '19 at 21:12
1

If exponential decay is too strong, then you could replace it with ^-1 decay `1/(age_in_days+1)`, or higher polynomials, like `1/((age_in_days+1)^2)` – Finomnis Jul 02 '19 at 21:15
So the 'random.choices()' seems to behave the same whether the exp is negative or positive. I would think the smaller the number, the less frequent it's chosen; Instead, the larger the difference between the max and min, the more often the extremes are chosen. However the '1/(age_in_days+1)' works the best thus far, for the resulting distribution. And it's not very consistent. – Jakar510 Jul 02 '19 at 22:07
Well, the larger 'age_factor', the larger the differences between an old and a new file, and the more often it will choose the new files. If you set 'age_factor' to 0, all files get chosen with the same probability. If you remove the negative inside the exp, then all in a sudden old files will get pick a LOT more often than newer ones. – Finomnis Jul 02 '19 at 22:10

Using Python 3.7.3, randomly select / choose from weighted list of files in a given directory

1 Answers1