1

I have a data set and one of columns contains random numbers raging form 300 to 400. I'm trying to find what proportion of this column in between 320 and 350 using R. To my understanding, I need to standardize this data and creates a bell curve first. I have the mean and standard deviation but when I do (X - mean)/SD and get histogram from this column it's still not a bell curve.

This the code I tried.

myData$C1 <- (myData$C1 - C1_mean) / C1_SD
  • the proportion you are looking for can be computed by filtering values of `C1` between the target range, and dividing by the total number of rows. `length(myData$C1[myData$C1 >= 320 & myData$C1 <= 350])/nrow(myData)` – Dij Apr 13 '19 at 22:50

1 Answers1

1

If you are simply counting the number of observations in that range, there's no need to do any standardization and you may directly use

mean(myData$C1 >= 320 & myData$C1 <= 350)

As for the standardization, it definitely doesn't create any "bell curves": it only shifts the distribution (centering) and rescales the data (dividing by the standard deviation). Other than that, the shape itself of the density function remains the same.

For instance,

x <- c(rnorm(100, mean = 300, sd = 20), rnorm(100, mean = 400, sd = 20))
mean(x >= 320 & x <= 350)
# [1] 0.065
hist(x)
hist((x - mean(x)) / sd(x))

enter image description here

I suspect that what you are looking for is an estimate of the true, unobserved proportion. The standardization procedure then would be applicable if you had to use tabulated values of the standard normal distribution function. However, in R we may do that without anything like that. In particular,

pnorm(350, mean = mean(x), sd = sd(x)) - pnorm(320, mean = mean(x), sd = sd(x))
# [1] 0.2091931

That's the probability P(320 <= X <= 350), where X is normally distributed with mean mean(x) and standard deviation sd(x). The figure is quite different from that above since we misspecified the underlying distribution by assuming it to be normal; it actually is a mixture of two normal distributions.

Julius Vainora
  • 47,421
  • 9
  • 90
  • 102
  • 1
    Why are you giving an example that uses a normal approximation for data that is clearly not normal? Just to show how bad it is? – MrFlick Apr 13 '19 at 22:30
  • Yes, to show that correctly specifying the true distribution is important, and that the OP is likely doing a similar mistake if his data looks nothing like a "bell curve". – Julius Vainora Apr 13 '19 at 22:35