0

First of all let me say that I am as new to the python world that I am to statistics. So I apologize in advance if my question seems trivial or even imprecise. I will do my best to express myself right.

I have an empirical dataset for a continuous variable. I have found a convenient piece of code (Data Fitting - El Nino example by @tmthydvnprt) that fits my dataset with different distribution types and returns the best one (smallest sum of square error between the distribution's histogram and the data's histogram.).

Now, I need to calculate the value that is smaller than 60% of the data elements. In other words if I have a dataset vector:

DataSet = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

I want to answer the question: what's the value, for which 60% of the elements are equal or larger?

value = 5 as there are 6/10 values that are equal or greater than 5.

As the distribution function that the code returns can be not normal I guess that the definition of standard deviation and mean do not really apply here. So how do I handle a 'random' probability distribution function to find the value I am looking for? Should I normalize it somehow or use median and quartiles? Or...?

Community
  • 1
  • 1
AMaz
  • 181
  • 1
  • 2
  • 10

1 Answers1

2

Sounds like you're just calculating percentiles but with a twist. Percentiles provide the value cutoff at which X% of the population falls below that value. Therefore, if you want to find out the value that is smaller than X% of the population, you just find the (100% - X%) percentile. In your case, you're finding the 40% percentile, with interpolation set to "higher" so that you don't get a value between two data points. However, if you want the exact cutoff, you can ignore that argument.

I would use numpy.percentile to calculate:

import numpy as np
a = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
p = np.percentile(a, 40, interpolation="higher")
p_exact = np.percentile(a, 40)
print p  # prints out 5
print p_exact  # prints out 4.6
Scratch'N'Purr
  • 9,959
  • 2
  • 35
  • 51