0

From a set of x unique items, I need to repeat each item y times such that y follows a normal distribution.

For example, if number of items n = 5, and y_max = 50. If we count how many times each item in my sorted list is repeated, the visual would look like this:

enter image description here

my_set=('a','b','c','d','e')
distribution = np.random.normal(len(my_set)/2, 1,len(my_set)).round().astype(int)
np.repeat(my_set, distribution)

I expect the result to follow a trend similar to the graph but instead, the result follows either an increasing or decreasing trend.

For readability, I'll use tuples instead of repeating each item y times.

Expected result should be something like:

[('a', 2), ('b', 4), ('c', 5), ('d', 3), ('e', 1)]

Actual result :

[('a', 5), ('b', 4), ('c', 3), ('d', 4), ('e', 3)]
amanb
  • 5,276
  • 3
  • 19
  • 38
ooo
  • 673
  • 3
  • 16
  • There exists [truncated normal distribution](http://en.wikipedia.org/wiki/Truncated_normal_distribution). Check [How to get a normal distribution within a range in numpy?](https://stackoverflow.com/q/36894191/5510499) and [How to specify upper and lower limits when using numpy.random.normal](https://stackoverflow.com/q/18441779/5510499). – Vadim Shkaberda Feb 04 '19 at 14:16

1 Answers1

1

Firstly, let us generate the desired result.

my_set = ('a', 'b', 'c', 'd', 'e')
distribution = np.random.normal(len(my_set)/2, 1, 10000).round().astype(int)
result = [my_set[max(min(el, 4), 0)] for el in distribution]
np.unique(result, return_counts=True)
>>> (array(['a', 'b', 'c', 'd', 'e'], dtype='<U1'),
>>> array([ 234, 1377, 3421, 3374, 1594]))

Here we generate 10000 random values from given distribution and take corresponding letter instead of each number. So counts represent just what we are looking for: the number of appearances of each letter is normally distibuted.

The core problem in your code is in understanding what is distribution or what value is normally distributed. When we call np.random.normal what it does is just generating a variable that is normally distributed. By definition of normal distributed it means that certain number x appears with certain probability p = pdf normal. From the point of view of frequencies it mean that if we run generation of variable for many times, fraction p of total number of trials will be x. And that is just what we are looking for.

In your code what you do is making such variable that numbers of occurences themselves are normally distributed. It means that each letter will appear n +- s times where s is normally distributed. So it is basically normal distribution with normal error. Reading your post thoroughly, I do not think that this is the thing you're looking for.

sooobus
  • 841
  • 1
  • 9
  • 22
  • In your `distribution` can be elements below zero. The way you wrote it, left tail of distribution will be added as "-N" elements of `my_set` to `result`. – Vadim Shkaberda Feb 04 '19 at 14:06
  • @sooobus Thank you for the explanation. Your last sentence says that this is not how I would want to do it. Could you provide more details on that? – ooo Feb 04 '19 at 15:57
  • @ooo I still did not understand what it the goal: to generate letters distributed normally or generate letters such that their frequences are distributed normally. The first case is what python code in the answer is for, and the second case is explained in the last paragraph of the answer. – sooobus Feb 04 '19 at 17:01