How to create bins for data then calculate the ecdf?

Question

I have a dataframe (see below) with 4 pieces per machine and a run time for each piece. I would like to bin the run time into bins of every 50 hours then calculate the empirical probability of the run times.

I have attempted to expand the grid to get the bins however I think it replicates it too much and the probabilities are inflated.

library(tidyverse)
set.seed(1)
data <- tibble(piece = rep(c("A", "B", "C", "D"), 1000),
               machine = rep(c("Mach1", "Mach2"), times = c(1200, 2800)),
               time = runif(4000, 0, 1000))

I expect the output to look something like this (note that these probabilities will not match the data provided above).

piece   machine     time    prob
A       Mach1       50      .03
A       Mach1       100     .04
A       Mach1       150     .09
A       Mach1       200     .12
...
A       Mach1       1000    1.0
...
B       Mach1       50      .05
B       Mach1       100     .11
B       Mach1       150     .12
B       Mach1       200     .14
...
B       Mach1       1000    1.0
.
.
.
A       Mach2       50      .02
A       Mach2       100     .05
...
B       Mach2       50      .06
B       Mach2       100     .10
...

I would like to use dplyr if possible to retain my pipe structure.

score 0 · Answer 1 · answered Aug 16 '19 at 15:31

dplyr's cumsum is helpful here (see also this answer):

data.ecdf = data %>%
  mutate(time = ceiling(time / 50) * 50) %>%
  group_by(piece, machine, time) %>%
  summarize(num.runs = n()) %>%
  ungroup() %>%
  group_by(piece, machine) %>%
  arrange(machine, piece, time) %>%
  mutate(prob = cumsum(num.runs / sum(num.runs)))

How to create bins for data then calculate the ecdf?

1 Answers1