-1

I am trying to add a probability curve on top of the histogram but it does not seem to work. For example

enter image description here

I wanted to add a probability line on the right side so I could show the density on the left and probability of happening on the right.

The code that I have done

x <- Delays_weather0$dif
h<-hist(x, breaks=10, col="red", xlab="Delays", 
   main="Flight Delays")

and the probability curve that I want to add on

my <- pnorm(-18:265, mean = mean(Delays_weather0$dif), sd = sd(Delays_weather0$dif), lower.tail = FALSE)
plot(my, type = "l")

I hope this is understandable

Jackie
  • 3
  • 3

2 Answers2

1

We don't have access to the Delays_weather0 dataset. Hence, I'll use the 1st 100 observations on dep_delay of the flights dataset, provided in the nycflights13 package.

Since the histogram in R by default plots the frequency, I'll multiply the probabilities by the number of observations, i.e. 1000 to make the two graph comparable.

I'm using the lines function at first.

library(nycflights13)
dataset <- flights$dep_delay[1:1000]
hist(x = dataset,
     breaks = 10,
     col = "red",
     xlab = "Delays", 
     main = "Flight Delays")
range_dataset <- range(dataset,
                       na.rm = TRUE)
equidistant_points_in_range <- seq(from = range_dataset[1],
                                   to = range_dataset[2],
                                   length.out = length(x = dataset))
upper_cdf_probabilities <- pnorm(q = equidistant_points_in_range,
                                 mean = mean(x = dataset,
                                             na.rm = TRUE),
                                 sd = sd(x = dataset,
                                         na.rm = TRUE),
                                 lower.tail = FALSE)
lines(x = length(x = dataset) * upper_cdf_probabilities,
      col = "blue")

Created on 2019-03-17 by the reprex package (v0.2.1)

Another way using the curve function.

dataset <- nycflights13::flights$dep_delay[1:1000]

range_dataset <- range(dataset,
                       na.rm = TRUE)

upper_tail_probability <- function(x)
{
  pnorm(q = x,
        mean = mean(x = dataset,
                    na.rm = TRUE),
        sd = sd(x = dataset,
                na.rm = TRUE),
        lower.tail = FALSE)
}

vectorized_upper_tail_probability <- Vectorize(FUN = upper_tail_probability)

hist(x = dataset,
     freq = FALSE,
     col = "red",
     xlab = "Delays", 
     main = "Flight Delays")

curve(expr = vectorized_upper_tail_probability,
      from = range_dataset[1],
      to = range_dataset[2],
      n = 1000,
      add = TRUE,
      col = "blue")

Created on 2019-03-17 by the reprex package (v0.2.1)

yarnabrina
  • 1,561
  • 1
  • 10
  • 30
0

Following up @yarnabrina's reproducible example to (1) use a kernel density estimator rather than assuming Normality, (2) put a probability axis on the right side:

library(nycflights13)
npts <- 1000
dataset <- flights$dep_delay[1:npts]
par(las=1,bty="l",      ## cosmetic
       mar=c(5,4,2,5))  ## expand R margin to make room for second axis
h0 <- hist(x = dataset,
           breaks=100,
           col = "red",
           xlab = "Delay (min)",
           ylab="",
           main="",
           xlim=c(-50,200))   ## cosmetic: leave out a few extreme values
## put axis label at *top* of axis
mtext(side=2,at=550,"Frequency")
## compute kernel density estimate
dd <- density(na.omit(dataset), adjust=3)
dx <- diff(h0$mids)[1]  ## histogram bin width
## scale density to match count vales
lines(dd$x,dd$y*npts*dx,lwd=2,col="blue")
## set up auxiliary axis
dbrks <- seq(0,0.05,by=0.01)
axis(side=4,at=dbrks*npts*dx,label=dbrks)
mtext(side=4,at=550,"Probability")  ## axis label
Ben Bolker
  • 211,554
  • 25
  • 370
  • 453