2

I would like to process data for each row. Let's say if we have collected "mpg" value for two "cyl" for 4 days. I would like to derive minimum mpg value relative to day)

Original Data

** day,cyl,mpg**

  • 1,4,34.4
  • 2,4,21.3
  • 3,4,23.3
  • 4,4,25.0
  • 1,3,23.0
  • 2,3,27.0
  • 3,3,18.3
  • 4,3,17.3

Expected Output

** day,cyl,mpg,min_mpg**

  • 1,4,34.4,34.4
  • 2,4,21.3,21.3
  • 3,4,23.3,21.3
  • 4,4,25.0,21.3
  • 1,3,23.0,23.0
  • 2,3,27.0,23.0
  • 3,3,18.3,18.3
  • 4,3,17.3,17.3

I have given a few thoughts as below...

  1. For Loop processing (which is really not most efficient options)

  2. APPLY and SHIFT function (retain minimum value from the previous row processing in a global variable and reset it to NA for each GROUP. I was unsuccessful to retain minimum mpg value into a global variable)

  3. APPLY and SHIFT function (shift "-1" all the way up to row #1 for each row. Kind of putting a loop in APPLY function. This option might be doing lot more redundant processing)

I tried to use rowShift function as described in the below blog but my requirement is that I need to shift dynamically Use a value from the previous row in an R data.table calculation

Is there any "vectorized" option available? OR Traditional FOR LOOP will be the only option? I prefer option using base R (either data frame or data table)

Community
  • 1
  • 1
R007
  • 101
  • 1
  • 13

1 Answers1

1

We can use cummin

library(dplyr)
df1 %>% 
   group_by(cyl) %>% 
   mutate(min_mpg = cummin(mpg))
# A tibble: 8 x 4
# Groups:   cyl [2]
#    day   cyl   mpg min_mpg
#  <int> <int> <dbl>   <dbl>
#1     1     4  34.4    34.4
#2     2     4  21.3    21.3
#3     3     4  23.3    21.3
#4     4     4  25      21.3
#5     1     3  23      23  
#6     2     3  27      23  
#7     3     3  18.3    18.3
#8     4     3  17.3    17.3

In base R, we can use ave

df1$min_mpg <- with(df1, ave(mpg, cyl, FUN = cummin))

Or using data.table

library(data.table)
setDT(df1)[, min_mpg := cummin(mpg), by = cyl][]

data

df1 <- structure(list(day = c(1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L), cyl = c(4L, 
4L, 4L, 4L, 3L, 3L, 3L, 3L), mpg = c(34.4, 21.3, 23.3, 25, 23, 
27, 18.3, 17.3)), class = "data.frame", row.names = c(NA, -8L
))
akrun
  • 874,273
  • 37
  • 540
  • 662