Data Processing in R

Question

If I have data that looks like the following:

My objective is to select within each neighborhood of 0.1 to select the maximum value. For example, since 1.11 and 1.21 is within 0.1, I would like to select the max of both columns (which would be 4.2*10^-3 for row corresponding to 1).

To find the max, I know from this post that I could use the pmax function. However, I was not sure how to find all columns in 0.1 "neigbhorhoods" and to create a new matrix that removes two of the columns.

What would be the desired outcome if you had multiple columns within 0.1 of each other (e.g. 1.11, 1.21, 1.31, 1.41)? Would you want to group all of them together and find the maximum across all those columns? — LucyMLi, Jan 22 '18 at 19:28
What will be the label for each maximum? Do you want to get a result for each column, or a result for each group of "close enough" columns? If the latter, how would you label a group of columns? By the first column in the group? — Isaac, Jan 22 '18 at 20:59
Yes, by the first column in the group or average of the column names. — user2657817, Jan 22 '18 at 21:18

score 1 · Answer 1 · answered Jan 22 '18 at 23:02

It's a bit ugly and there may well be a more idiomatic way to do it, but I think this does what you want.

groupMaxes <- function(data, dif=0.1) {
    cols <- as.numeric(names(data)) # Column names as numbers
    data <- data[,order(cols)] # Order the data by column names
    maxes <- NULL # Structure to store the max columns we compute
    ncol <- length(cols) # Number of columns in data set
    i <- 1
    while (i <= ncol) {
        curcol <- cols[[i]]
        mx <- data[,i]
        while (i < ncol && cols[[i + 1]] - cols[[i]] < dif) {
            i <- i + 1
            mx <- pmax(mx, data[,i])
        }
        newcol <- data.frame(mx)
        names(newcol) <- curcol
        if (is.null(maxes))
            maxes <- newcol
        else
            maxes <- cbind(maxes, newcol)
        i <- i + 1
    }
    maxes
}

Example:

> a
  1.11 1.21 1.32
1    9    4    1
2    0    0    1
3    0    0    1
4    0    3    1
5    0    0    1
> groupMaxes(a)
  1.11 1.32
1    9    1
2    0    1
3    0    1
4    3    1
5    0    1
> groupMaxes(a, .2)
  1.11
1    9
2    1
3    1
4    3
5    1

score 1 · Answer 2 · answered Jan 24 '18 at 22:52

First, a function to group the column names that are within 0.1 of each other:

group_vector <- function (vec, threshold=0.1) {
  vec <- sort(vec)
  groups <- as.list(1:length(vec))
  ngroups <- 1
  for (i in 2:length(vec)) {
    if ((vec[i]-vec[i-1])<=threshold) {
      groups[[ngroups]] <- c(groups[[ngroups]], i)
    } else {
      ngroups <- ngroups + 1
      groups[[ngroups]] <- i
    }
  }
  groups[1:ngroups]
}

Then, a function that calculates the maximum value for each row within grouped columns, and renames the new column as the average value:

group_data_max <- function (original_data, threshold) {
  vec <- as.numeric(names(original_data))
  original_data <- original_data[, order(vec)]
  groups <- group_vector(vec=vec, threshold=threshold)
  new.data <- data.frame(lapply(groups, function (x) {
    apply(data.frame(original_data[, x]), 1, max)
  }))
  names(new.data) <- sapply(groups, function (x) mean(sort(vec)[x]))
  new.data
}

Testing on an example data set:

set.seed(1000)
example.data <- data.frame(replicate(length(vec), sapply(rnorm(1000, 0, 1e-3), max, 0)))
names(example.data) <- round(runif(10, 0, 2), 3)
head(example.data)
         1.656        0.894        0.708        1.307        0.818        1.899
1 0.000000e+00 0.0020804209 1.222081e-03 0.0000000000 0.0006729516 0.0022213225
2 0.000000e+00 0.0000000000 0.000000e+00 0.0000000000 0.0000000000 0.0000000000
3 4.112631e-05 0.0008626092 0.000000e+00 0.0006871080 0.0003988231 0.0015567983
4 6.393884e-04 0.0006410248 9.315820e-04 0.0001706286 0.0000000000 0.0000000000
5 0.000000e+00 0.0000000000 9.995761e-05 0.0000000000 0.0006471052 0.0005108526
6 0.000000e+00 0.0000000000 0.000000e+00 0.0013138954 0.0012562174 0.0005090567
         0.994        1.641        1.751        1.138
1 0.0003542045 0.0000000000 0.000000e+00 0.0006481930
2 0.0003942478 0.0000000000 0.000000e+00 0.0015370211
3 0.0013130688 0.0000000000 8.991744e-04 0.0005104541
4 0.0000000000 0.0001117057 1.011685e-03 0.0002280315
5 0.0001000137 0.0000000000 1.733699e-05 0.0000000000
6 0.0000000000 0.0020320953 3.266437e-04 0.0011959593

Result:

group_data_max(example.data, threshold=0.1)

         0.708        0.902        1.138        1.307 1.68266666666667
1 1.222081e-03 0.0020804209 0.0006481930 0.0000000000     0.000000e+00
2 0.000000e+00 0.0003942478 0.0015370211 0.0000000000     0.000000e+00
3 0.000000e+00 0.0013130688 0.0005104541 0.0006871080     8.991744e-04
4 9.315820e-04 0.0006410248 0.0002280315 0.0001706286     1.011685e-03
5 9.995761e-05 0.0006471052 0.0000000000 0.0000000000     1.733699e-05
6 0.000000e+00 0.0012562174 0.0011959593 0.0013138954     2.032095e-03
         1.899
1 0.0022213225
2 0.0000000000
3 0.0015567983
4 0.0000000000
5 0.0005108526
6 0.0005090567

Data Processing in R

2 Answers2