0

I have production data, that has 38 different statuses ranging from 0 - 5000, i.e 3000, 3100, 3500. Along with each one of the status, is a status date, model number, and a ship to location.

I need to figure out, on average, how long it takes an item to go from status 3000 to 5000 (delivered), given that it is item M going to destination X. Not just for status 3000 to 4000, but for all 37 different statuses (that are before delivery) so that I can set up a predictive model to forecast future delivery times.

Data Set:

ID Status Status Date "%m/%d/%Y" Model Ship to Location
ABC123 3000 1/1/2023 M X
ABC123 5000 1/5/2023 M X
ABC124 2000 5/10/2022 N Y
ABC124 5000 5/15/2022 N Y

ABC124 2500 12/10/2023 M X

This is the result I am looking for:

Status Model Ship to Location Days needed to reach status 5000
3000 M X 20
3100 M X 15
3300 N Y 10
3400 N Y 5
Haphy_Paphy
  • 63
  • 10
  • Please don't spam tags. The Stack tag-recommendation system is imperfect, it recommends [tag:rstudio] for anything R related, but that tag is reserved for questions specific to the IDE. If you mouse each tag (I have since deleted that one), the hover text says _"do not use this tag for general R programming problems"_. Thanks! – r2evans May 01 '23 at 15:06
  • Should this be done per-`ID`? Or do you not care if a status change is between different IDs? – r2evans May 01 '23 at 15:07
  • FYI, if you want "days to reach status 5000", it would help to have that status in the sample data ... – r2evans May 01 '23 at 15:08
  • Finally, you need to convert your `Status Date` column to a `Date`-class object using `as.Date`. _These_ values are ambiguous, is this format `"%m/%d/%Y"` or `"%d/%m/%Y"`? (If you aren't familiar with `%`-codes, see [`?strptime`](https://stat.ethz.ch/R-manual/R-devel/library/base/html/strptime.html) for these and many more.) – r2evans May 01 '23 at 15:09
  • There can be a significant variance for the times needed to go from one status to the next on an id level. So for me it is important to know that, on average Model M needs 30 days to go from status 3000 to 5000 if it is being delivered to location X. – Haphy_Paphy May 01 '23 at 15:10
  • Okay, so `ID` is not at all useful here? In your real data, you only have one occurrence of `3000` and only one occurrence of `5000`? From your description (and your use of "variance"), there are multiple things being tracked, which suggests the need for an "ID" field. – r2evans May 01 '23 at 15:11
  • 1
    Ok, will make the adjustments. It is this format: "%m/%d/%Y" – Haphy_Paphy May 01 '23 at 15:11
  • ID is needed to measure the time needed for an individual item to go from one status to the next. Then once all of that data is available, an average should be taken over all items. There is only one occurrence of each status for an item. But the statuses occur multiple times in the report. I.e there are 5000 items in status 3000, and 2000 in status 3300, and then 1000 in status 5000. – Haphy_Paphy May 01 '23 at 15:13

1 Answers1

1

Try this:

library(dplyr)
quux %>%
  mutate(Status.Date = as.Date(Status.Date, format = "%m/%d/%Y")) %>%
  group_by(ID) %>%
  mutate(
    time_to_5000 =
      if (5000 %in% Status) Status.Date[Status==5000][1] - Status.Date else NA
  ) %>%
  ungroup()
# # A tibble: 5 × 6
#   ID     Status Status.Date Model Ship.to.Location time_to_5000
#   <chr>   <dbl> <date>      <chr> <chr>            <drtn>      
# 1 ABC123   3000 2023-01-01  M     X                14 days     
# 2 ABC123   3100 2023-01-10  M     X                 5 days     
# 3 ABC124   2000 2023-11-01  N     Y                NA days     
# 4 ABC124   2500 2023-12-10  N     Y                NA days     
# 5 ABC123   5000 2023-01-15  M     X                 0 days     

Technically we don't need the if statement. That is, this also works:

... %>%
  mutate(time_to_5000 = Status.Date[Status==5000][1] - Status.Date) %>%
  ...

I included it above because I wasn't certain if it made sense to have multiple conditions (e.g., both 3000/5000 must be present) in order to return a non-NA value. There are many ways to effect this logic.


Data (augmented to include at least one 5000):

quux <- structure(list(ID = c("ABC123", "ABC123", "ABC124", "ABC124", "ABC123"), Status = c(3000, 3100, 2000, 2500, 5000), Status.Date = c("1/1/2023", "1/10/2023", "11/1/2023", "12/10/2023", "1/15/2023"), Model = c("M", "M", "N", "N", "M"), Ship.to.Location = c("X", "X", "Y", "Y", "X")), row.names = c(NA, -5L), class = "data.frame")
r2evans
  • 141,215
  • 6
  • 77
  • 149
  • 1
    Thanks, will let you know how it works! – Haphy_Paphy May 01 '23 at 15:25
  • So this seems to have worked, but as I am very new to R I am struggling to analyze this data. How can I save this as a table or dataframe which I can view and export? Also, I do not need to have the results on an ID basis. The next thing I need to do is average the time needed to go to status 5000 on a model & ship.to.location level. – Haphy_Paphy May 01 '23 at 16:28
  • @Haphy_Paphy to save the restructured data as a new data frame, just use the `<-` operator, something like `new_quux <- quux %>% mutate(....)` - and [this website](https://www.statology.org/r-mean-by-group/) may give you some tips on calculating mean by group (admittedly, was the first site when I googled, "mean by group r") – jpsmith May 01 '23 at 16:41
  • Once you know the time to read 5000 for each ID, then ungroup (as I've done) and then summarize, perhaps something like `%>% filter(Status == 3000) %>% summarize(across(time_to_5000, list(mu = mean, sigma = ~ sd(., na.rm = TRUE))))`. See https://stackoverflow.com/q/11562656/3358272, https://stackoverflow.com/q/1660124/3358272 for more "summarize by group" discussions. – r2evans May 01 '23 at 16:46
  • I appreciate the help! Another quick question, if you don't mind. What is the easiest way to filter out time_to_5000 = NA or time_to_5000 = 0? – Haphy_Paphy May 01 '23 at 22:08
  • `!time_to_5000 %in% c(NA, 0)`, assuming that floating-point comparison is okay (see https://stackoverflow.com/q/9508518/3358272, https://cran.r-project.org/doc/FAQ/R-FAQ.html#Why-doesn_0027t-R-think-these-numbers-are-equal_003f, long discussion there). Alternative, `!is.na(time_to_5000) & time_to_5000 > 0` also works. – r2evans May 01 '23 at 22:30