0

I am facing an issue to compute the median of values in a column. I have hundreds of values in a column but I want to compute the median of five values in a sequence at a time and store the median value in a separate data frame in R and similarly continue until the end of data values available.

The issue is there are some values with #VALUE! and while computing the median if there comes such a value i.e. #VALUE! and there are less than 5 values to take median, then the program should just take the median with whatever number of values are available. Similarly, for the last median value, if there are less than 5 values available to take a median, then it should compute the median with the available number of values.

The link to the .csv file with one column of sample data could be downloaded from here.

LINK to FILE

If someone can help me with this I would be very grateful.

Thank you

  • 1
    The presence of your `#VALUE!` error is not an R thing, I believe that's an Excel-only thing. If you are seeing that, are you using some R/Excel integration? Regardless, "five values in a sequence" sounds like a rolling calculation, easily done using `zoo::rollmedian` (generically `zoo::rollapply`) or one of the `slider`-package equivalent functions (I'm not as familiar). Side note: many people prefer to not click on external links, *and links go stale*, it's helpful to have sample data in the question itself. See https://stackoverflow.com/q/5963269. Thank you! – r2evans Dec 03 '21 at 13:42

2 Answers2

1

You could try this:

  1. Group by a sequence of 5 rows
  2. replace #VALUE! with NA
  3. convert to numeric
  4. summarise with median
Speed %>% 
  group_by(group5 = rep(row_number(), each=5, length.out = n())) %>% 
  mutate(speed = ifelse(speed== "#VALUE!", NA, speed)) %>% 
  type.convert(as.is = TRUE) %>% 
  summarise(median = median(speed, na.rm = TRUE))
  group5 median
    <int>  <dbl>
 1      1   1.32
 2      2  -4.97
 3      3 -13.1 
 4      4 -14.3 
 5      5   6.89
 6      6  -2.97
 7      7 -11.6 
 8      8 -16.0 
 9      9 -18.6 
10     10 -19.9 
# ... with 72 more rows
TarJae
  • 72,363
  • 6
  • 19
  • 66
1

Update for "tumbling window" (top, here) versus "rolling window" (below, kept for posterity/reference). Still using the dat from the top of the rolling-window discussion.

The discussion about #VALUE! (far below) may still be relevant, I'll include the code here.

base R

dat$speed <- suppressWarnings(as.numeric(dat$speed))
aggregate(dat$speed, list(grp = (seq_len(nrow(dat)) - 1) %/% 5), FUN = median, na.rm = TRUE)
#   grp       x
# 1   0  3.4245
# 2   1 -4.9730

dplyr

library(dplyr)
dat %>%
  mutate(speed = suppressWarnings(as.numeric(speed))) %>%
  group_by(grp = (seq_len(n()) - 1) %/% 5) %>%
  summarize(med5 = median(speed, na.rm = TRUE))
# # A tibble: 2 x 2
#     grp  med5
#   <dbl> <dbl>
# 1     0  3.42
# 2     1 -4.97

data.table

library(data.table)
as.data.table(dat)[, speed := suppressWarnings(as.numeric(speed))
  ][, .(med5 = median(speed, na.rm = TRUE)), by = .(grp = (seq_len(nrow(dat)) - 1) %/% 5)][]
#      grp    med5
#    <num>   <num>
# 1:     0  3.4245
# 2:     1 -4.9730

(This below is for rolling-windows, no longer relevant other than the definition of dat data.)


I copied the first 10 rows from that single-column frame, and get

dat <- structure(list(speed = c(0, 5.534, 1.315, 7.6865, -0.479, -0.4605, -4.311, -4.973, -7.69, -11.669)), row.names = c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10"), class = "data.frame")

With this, the basic operation is something like this:

newvec <- zoo::rollmedian(dat$speed, 5)
newvec
# [1]  1.3150  1.3150 -0.4605 -0.4790 -4.3110 -4.9730

Note that this creates nrow(dat) - (k - 1) (20) values, where k=5 (your window). In general, rolling operations tend to be reductive, but we have options to change that if needed. For instance, we can keep it the same length and pad the ends with NA; to do this, we can "align" the window left, center, or right:

zoo::rollmedian(dat$speed, 5, fill = NA, align = "left")
#  [1]  1.3150  1.3150 -0.4605 -0.4790 -4.3110 -4.9730      NA      NA      NA      NA
zoo::rollmedian(dat$speed, 5, fill = NA, align = "center")
#  [1]      NA      NA  1.3150  1.3150 -0.4605 -0.4790 -4.3110 -4.9730      NA      NA
zoo::rollmedian(dat$speed, 5, fill = NA, align = "right")
#  [1]      NA      NA      NA      NA  1.3150  1.3150 -0.4605 -0.4790 -4.3110 -4.9730

Let me show how alignment fits here. With align="center" (the default), we see:

dat$speed
#  [1]   0.0000   5.5340   1.3150   7.6865  -0.4790  -0.4605  -4.3110  -4.9730  -7.6900 -11.6690
###      `----------------------------------------'
###          take the median of these values,
###          and then assign that single value here:
###                                              /
###                             ,---------------'
###                            / 
#  [1]       NA       NA   1.3150   1.3150  -0.4605  -0.4790  -4.3110  -4.9730       NA       NA

dat$speed[1:5]
# [1]  0.0000  5.5340  1.3150  7.6865 -0.4790
median(dat$speed[1:5])
# [1] 1.315

For the next value,

#  [1]   0.0000   5.5340   1.3150   7.6865  -0.4790  -0.4605  -4.3110  -4.9730  -7.6900 -11.6690
###               `----------------------------------------'
###                   take the median of these values,
###                   and then assign that single value here:
###                                                       /
###                                      ,---------------'
###                                     / 
#  [1]       NA       NA   1.3150   1.3150  -0.4605  -0.4790  -4.3110  -4.9730       NA       NA

dat$speed[2:6]
# [1]  5.5340  1.3150  7.6865 -0.4790 -0.4605
median(dat$speed[2:6])
# [1] 1.315

So we can assign that easily to a new frame with data.frame(rollmed = newvec), with or without padding. If you want to assign it back to the original frame, one can do:

dat$rollmed <- zoo::rollmedian(dat$speed, 5, fill = NA, align = "center")
dat
#       speed rollmed
# 1    0.0000      NA
# 2    5.5340      NA
# 3    1.3150  1.3150
# 4    7.6865  1.3150
# 5   -0.4790 -0.4605
# 6   -0.4605 -0.4790
# 7   -4.3110 -4.3110
# 8   -4.9730 -4.9730
# 9   -7.6900      NA
# 10 -11.6690      NA

As for your #VALUE!, that likely manifests as a character column instead of numeric, in which case you have an extra step before all of the above.

I'll change one of my values to that error:

dat$speed[5] <- "#VALUE!"
dat
#      speed
# 1        0
# 2    5.534
# 3    1.315
# 4   7.6865
# 5  #VALUE!
# 6  -0.4605
# 7   -4.311
# 8   -4.973
# 9    -7.69
# 10 -11.669
str(dat)
# 'data.frame': 10 obs. of  1 variable:
#  $ speed: chr  "0" "5.534" "1.315" "7.6865" ...

(See? chr.)

From here, we can simply convert all to numbers, ignoring the warnings we get:

dat$speed <- suppressWarnings(as.numeric(dat$speed))
dat
#       speed
# 1    0.0000
# 2    5.5340
# 3    1.3150
# 4    7.6865
# 5        NA
# 6   -0.4605
# 7   -4.3110
# 8   -4.9730
# 9   -7.6900
# 10 -11.6690
str(dat)
# 'data.frame': 10 obs. of  1 variable:
#  $ speed: num  0 5.53 1.31 7.69 NA ...

And from here, we can do the rolling median again. Note that the NA we now have changes things slightly:

dat$rollmed <- zoo::rollmedian(dat$speed, 5, fill = NA, align = "center")
dat$rollmed2 <- zoo::rollmedian(dat$speed, 5, fill = NA, align = "center", na.rm = TRUE)
dat
#       speed rollmed rollmed2
# 1    0.0000      NA       NA
# 2    5.5340      NA       NA
# 3    1.3150      NA  3.42450
# 4    7.6865      NA  3.42450
# 5        NA      NA  0.42725
# 6   -0.4605      NA -2.38575
# 7   -4.3110      NA -4.64200
# 8   -4.9730  -4.973 -4.97300
# 9   -7.6900      NA       NA
# 10 -11.6690      NA       NA

The default (what we did before) will return an NA median for every row within +/- 4 rows (k-1) of the previous #VALUE!; we can add na.rm=TRUE if you would prefer; that is not a rolling-window thing, that's a general-statistics question, "are empty values an issue".

r2evans
  • 141,215
  • 6
  • 77
  • 149
  • Thanks, @r2evans for the answer. Actually, I do not want to do a rolling average like the answer you got at the start with basic operation. Rather, I want to do an average of 5 rows and then the average of the next 5 rows. How can I modify the code to give me that? – Muhammad Usama Ashraf Dec 03 '21 at 14:31
  • Yes, I misunderstood that. You need to clarify your question. BTW, I often find tumbling windows (that's what it's called) are usually "aligned" with something, such as 1Hz on 5Hz data, or weekly on weekday data, etc. Occasionally a tumbling window in the absence of "time" or another variable is certainly useful, but is there any "time" component in your data? – r2evans Dec 03 '21 at 14:34
  • No, my data has no "time." But thank you anyways. You helped me understand many other things while answering the main question. Much appreciated. – Muhammad Usama Ashraf Dec 03 '21 at 14:38