2

I have a table:

ID   Dates        Rates
1  2010-01-01       0
1  2010-01-02       0
1  2010-01-03       2
1  2010-01-04       2
1  2010-01-05       2
1  2010-01-06       1
1  2010-01-07       0
1  2010-01-08       0
1  2010-01-09       0
1  2010-01-10       0
2  2010-01-01       3
2  2010-01-02       3
2  2010-01-03       2

And I want to calculate the third column called "median rates" in Rstudio to show the median value of every 5 consecutive rows and the table should look as below

ID   Dates       Rates   Median_Rates
1    2010-01-01   0        2
1    2010-01-02   0        2
1    2010-01-03   2        2
1    2010-01-04   2        2
1    2010-01-05   2        2
1    2010-01-06   1        0
1    2010-01-07   0        0
1    2010-01-08   0        0
1    2010-01-09   0        0
1    2010-01-10   0        0
2    2010-01-01   3        3
2    2010-01-02   3        3
2    2010-01-03   2        3

And then apply this to all IDs and more than 1 million rows in the dataset?

I want to calculate the median value of Rate for every consecutive 5 rows (e.g. this position +/- 5 rows) by group (ID) and use that as the value of Median_Rates.

NightDog
  • 91
  • 7
  • 1
    I want to calculate the median value of rate for every consecutive 5 rows (e.g. this position +/- 5 rows) and the value used to calculate median rate should be the value from the rate column. @divibisan – NightDog Apr 12 '18 at 18:25
  • 3
    This is called a "rolling median", if you search for that term you should find plenty of examples. In `zoo::rollapply` you can specify `FUN = median`. – Gregor Thomas Apr 12 '18 at 18:25

2 Answers2

1

Function ave is meant for this.
I have borrowed the idea from the accepted answer to this question, changing tapply to ave and sum to median.

data$Median_Rates <- ave(data$Rates, (seq_along(data$Rates)-1) %/% 5, FUN = median)
data
#   ID      Dates Rates Median_Rates
#1   1 2010-01-01     0            2
#2   2 2010-01-02     0            2
#3   3 2010-01-03     2            2
#4   4 2010-01-04     2            2
#5   5 2010-01-05     2            2
#6   5 2010-01-06     1            0
#7   7 2010-01-07     0            0
#8   8 2010-01-08     0            0
#9   9 2010-01-09     0            0
#10 10 2010-01-10     0            0

DATA

data <-
structure(list(ID = c(1L, 2L, 3L, 4L, 5L, 5L, 7L, 8L, 9L, 10L
), Dates = structure(1:10, .Label = c("2010-01-01", "2010-01-02", 
"2010-01-03", "2010-01-04", "2010-01-05", "2010-01-06", "2010-01-07", 
"2010-01-08", "2010-01-09", "2010-01-10"), class = "factor"), 
    Rates = c(0L, 0L, 2L, 2L, 2L, 1L, 0L, 0L, 0L, 0L)), .Names = c("ID", 
"Dates", "Rates"), class = "data.frame", row.names = c(NA, -10L
))

EDIT.
With the new dataset, all that is needed is to also include column ID as a grouping variable in the call to ave.
I will call this new dataset data2.

data2$Median_Rates <- ave(data2$Rates, data2$ID, (seq_along(data2$Rates)-1) %/% 5, FUN = median)
data2
#   ID      Dates Rates Median_Rates
#1   1 2010-01-01     0            2
#2   1 2010-01-02     0            2
#3   1 2010-01-03     2            2
#4   1 2010-01-04     2            2
#5   1 2010-01-05     2            2
#6   1 2010-01-06     1            0
#7   1 2010-01-07     0            0
#8   1 2010-01-08     0            0
#9   1 2010-01-09     0            0
#10  1 2010-01-10     0            0
#11  2 2010-01-01     3            3
#12  2 2010-01-02     3            3
#13  2 2010-01-03     2            3

NEW DATA

data2 <-
structure(list(ID = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
2L, 2L, 2L), Dates = structure(c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 
8L, 9L, 10L, 1L, 2L, 3L), .Label = c("2010-01-01", "2010-01-02", 
"2010-01-03", "2010-01-04", "2010-01-05", "2010-01-06", "2010-01-07", 
"2010-01-08", "2010-01-09", "2010-01-10"), class = "factor"), 
    Rates = c(0L, 0L, 2L, 2L, 2L, 1L, 0L, 0L, 0L, 0L, 3L, 3L, 
    2L)), .Names = c("ID", "Dates", "Rates"), class = "data.frame", row.names = c(NA, 
-13L))
Rui Barradas
  • 70,273
  • 8
  • 34
  • 66
  • @Y.Yuan You ask for medians "every consecutive 5 rows" (in your comment). My first answer does that. Then you have edited the question with new data. I will see what I can do. – Rui Barradas Apr 12 '18 at 21:16
  • @Y.Yuan Done, see the edit and the new dataset in `dput` format. – Rui Barradas Apr 12 '18 at 21:22
1

A dplyr based solution using lubridate to convert to Date can be achieved as:

library(dplyr)
library(lubridate)

df %>% mutate(Dates = ymd(Dates)) %>%
  group_by(ID) %>%
  arrange(Dates) %>%
  mutate(Group = (row_number()-1) %/% 5 ) %>%
  group_by(ID, Group) %>%
  mutate(Median_Rates = median(Rates)) %>%
  ungroup() %>%
  arrange(ID) %>%
  select(-Group) %>% as.data.frame()

#    ID      Dates Rates Median_Rates
# 1   1 2010-01-01     0            2
# 2   1 2010-01-02     0            2
# 3   1 2010-01-03     2            2
# 4   1 2010-01-04     2            2
# 5   1 2010-01-05     2            2
# 6   1 2010-01-06     1            0
# 7   1 2010-01-07     0            0
# 8   1 2010-01-08     0            0
# 9   1 2010-01-09     0            0
# 10  1 2010-01-10     0            0
# 11  2 2010-01-01     3            3
# 12  2 2010-01-02     3            3
# 13  2 2010-01-03     2            3
MKR
  • 19,739
  • 4
  • 23
  • 33