How to Standardize a Column of Data in R and Get Bell Curve Histogram to fins a percentage that falls within a ranges?

Question

I have a data set and one of columns contains random numbers raging form 300 to 400. I'm trying to find what proportion of this column in between 320 and 350 using R. To my understanding, I need to standardize this data and creates a bell curve first. I have the mean and standard deviation but when I do (X - mean)/SD and get histogram from this column it's still not a bell curve.

This the code I tried.

myData$C1 <- (myData$C1 - C1_mean) / C1_SD

the proportion you are looking for can be computed by filtering values of `C1` between the target range, and dividing by the total number of rows. `length(myData$C1[myData$C1 >= 320 & myData$C1 <= 350])/nrow(myData)` — Dij, Apr 13 '19 at 22:50

Julius Vainora · Accepted Answer · 2019-04-13T22:25:35.623

If you are simply counting the number of observations in that range, there's no need to do any standardization and you may directly use

mean(myData$C1 >= 320 & myData$C1 <= 350)

As for the standardization, it definitely doesn't create any "bell curves": it only shifts the distribution (centering) and rescales the data (dividing by the standard deviation). Other than that, the shape itself of the density function remains the same.

For instance,

x <- c(rnorm(100, mean = 300, sd = 20), rnorm(100, mean = 400, sd = 20))
mean(x >= 320 & x <= 350)
# [1] 0.065
hist(x)
hist((x - mean(x)) / sd(x))

I suspect that what you are looking for is an estimate of the true, unobserved proportion. The standardization procedure then would be applicable if you had to use tabulated values of the standard normal distribution function. However, in R we may do that without anything like that. In particular,

pnorm(350, mean = mean(x), sd = sd(x)) - pnorm(320, mean = mean(x), sd = sd(x))
# [1] 0.2091931

That's the probability P(320 <= X <= 350), where X is normally distributed with mean mean(x) and standard deviation sd(x). The figure is quite different from that above since we misspecified the underlying distribution by assuming it to be normal; it actually is a mixture of two normal distributions.

Why are you giving an example that uses a normal approximation for data that is clearly not normal? Just to show how bad it is? — MrFlick, Apr 13 '19 at 22:30
Yes, to show that correctly specifying the true distribution is important, and that the OP is likely doing a similar mistake if his data looks nothing like a "bell curve". — Julius Vainora, Apr 13 '19 at 22:35

How to Standardize a Column of Data in R and Get Bell Curve Histogram to fins a percentage that falls within a ranges?

1 Answers1