1

I want to calculate the probability of all the data in a column dataframe according to its own distribution.For example,my data like this:

    data
0      1
1      1
2      2
3      3
4      2
5      2
6      7
7      8
8      3
9      4
10     1

And the output I expect like this:

    data       pro
0      1  0.155015
1      1  0.155015
2      2  0.181213
3      3  0.157379
4      2  0.181213
5      2  0.181213
6      7  0.048717
7      8  0.044892
8      3  0.157379
9      4  0.106164
10     1  0.155015

I also refer to another question(How to compute the probability ...) and get an example of the above.My code is as follows:

import scipy.stats
samples = [1,1,2,3,2,2,7,8,3,4,1]
samples = pd.DataFrame(samples,columns=['data'])
print(samples)
kde = scipy.stats.gaussian_kde(samples['data'].tolist())
samples['pro'] = kde.pdf(samples['data'].tolist())
print(samples)

But what I can't stand is that if my column is too long, it makes the operation slow.Is there a better way to do it in pandas?Thanks in advance.

giser_yugang
  • 6,058
  • 4
  • 21
  • 44

1 Answers1

6

Its own distribution does not mean kde. You can use value_counts with normalize=True

df.assign(pro=df.data.map(df.data.value_counts(normalize=True)))

    data       pro
0      1  0.272727
1      1  0.272727
2      2  0.272727
3      3  0.181818
4      2  0.272727
5      2  0.272727
6      7  0.090909
7      8  0.090909
8      3  0.181818
9      4  0.090909
10     1  0.272727
piRSquared
  • 285,575
  • 57
  • 475
  • 624
  • First of all, thank you for your answer. Secondly, I would like to ask if I can get the probability from the probability density function. If my number does not belong to the above value, how can I get the probability. For example,how can I get the probability of a value is 1.5 based on the distribution of that column? – giser_yugang May 31 '17 at 07:33