2

I am trying to find a way to find the minimums of a column in a matrix dependent on values in another column. Suppose my matrix looks like so:

A    B
3    1.2
3    3.4
3    0.7
4    4.5
4    4.9
5    0.1
5    0.4
6    5

I wish to find the "local maximum" of B for each column sharing the same numbers in A.

So, I would like to get a new variable C, that looks like:

A    B     C
3    1.2   0.7
3    3.4   0.7
3    0.7   0.7
4    4.5   4.5
4    4.9   4.5
5    0.1   0.1
5    0.4   0.1
6    5     5

As one can see, the values in C is the minimum value of entries of B sharing the same value in A.

for(i in 1:length(data)){

if (A[i]==A[i+1])
else C <- min(B[i])

}

A for loop above like this might do the trick, but gets very computationally intensive and expensive. I was wondering if there was a simpler way to do this in creating a column with minimum values. Thanks!

user1398057
  • 1,109
  • 1
  • 11
  • 15

4 Answers4

5

There are two ways to approach this. The easiest is:

> data$C <- ave(data$B, data$A, FUN = min)
> d
  A   B   C
1 3 1.2 0.7
2 3 3.4 0.7
3 3 0.7 0.7
4 4 4.5 4.5
5 4 4.9 4.5
6 5 0.1 0.1
7 5 0.4 0.1
8 6 5.0 5.0

The other is the "split-apply-combine" strategy.

Thomas
  • 43,637
  • 12
  • 109
  • 140
4

Thomas's answer is good. This is also pretty easy to do in data.table.

First, generate the data and load the package:

a <- c(rep(1,10),rep(2,10),rep(3,10))
b <- rnorm(30,10,10)
test <- data.frame(a,b)
library(data.table)
setDT(test)

Then create the new column:

test[,c:=min(b),by=a]

The results:

     a            b          c
  1: 1 14.844139174 -5.7023026
  2: 1 23.415562328 -5.7023026
  3: 1 11.235909827 -5.7023026
  4: 1  3.157023269 -5.7023026
  5: 1 16.998426190 -5.7023026
  6: 1  2.610968522 -5.7023026
  7: 1 12.978947375 -5.7023026
  8: 1 -5.702302571 -5.7023026
  9: 1 15.497820331 -5.7023026
 10: 1 25.732043280 -5.7023026
 11: 2  4.141440090 -0.2139922
 12: 2 16.249885918 -0.2139922
 13: 2  4.644226740 -0.2139922
 14: 2 19.529676583 -0.2139922
 15: 2 -0.213992236 -0.2139922
 16: 2 24.080005586 -0.2139922
 17: 2  5.670197636 -0.2139922
 18: 2 28.791961411 -0.2139922
 19: 2  5.514285666 -0.2139922
 20: 2 14.711643377 -0.2139922
 21: 3  3.933511784 -0.1845652
 22: 3  2.159187044 -0.1845652
 23: 3 21.925758264 -0.1845652
 24: 3 15.572109856 -0.1845652
 25: 3  0.001924788 -0.1845652
 26: 3 20.288486370 -0.1845652
 27: 3 16.568929808 -0.1845652
 28: 3  2.180839713 -0.1845652
 29: 3 -0.184565166 -0.1845652
 30: 3  1.236794095 -0.1845652

In the comments there is a brief discussion about setDT() versus <- data.table. setDT() does not create a copy, and so is more memory efficient than <-data.table, which I originally had in this post. Thanks to David Arenburg for the information.

bjoseph
  • 2,116
  • 17
  • 24
3

I have just compare the time needed for executing the commands ave und data.table for a big dataset (10 mio.). Information about my computer's prozessor: Core i3, CPU M350 2.27Ghz

a <- gl(1000000,10)
b <- rnorm(length(a))
test <- data.frame(a,b)
library(data.table)
test <- data.table(test)


Rprof("Test")
l <- test[,c:=min(b),by=a]
Rprof(NULL)
head(summaryRprof("Test")$by.total, 5)
# --------------------------------------------------------
                   total.time total.pct self.time self.pct
".Call"              1.30    100.00      0.92    70.77
"["                  1.30    100.00      0.00     0.00
"[.data.table"       1.30    100.00      0.00     0.00
"min"                0.38     29.23      0.38    29.23
"forder"             0.16     12.31      0.00     0.00
# --------------------------------------------------------

Rprof("Test1")
test$c <- ave(test$b, test$a, FUN = min)
Rprof(NULL)
head(summaryRprof("Test1")$by.total, 5)
 # --------------------------------------------------------
                  total.time total.pct self.time self.pct
"ave"                   7.16     89.72      0.04     0.50
"split<-.default"       3.74     46.87      2.36    29.57
"split<-"               3.74     46.87      0.00     0.00
"lapply"                2.68     33.58      0.84    10.53
"split"                 1.50     18.80      0.00     0.00
"split.default"         1.48     18.55      1.48    18.55
"FUN"                   1.28     16.04      1.28    16.04
".Call"                 0.82     10.28      0.82    10.28
"$<-"                   0.82     10.28      0.00     0.00
"$<-.data.table"        0.82     10.28      0.00     0.00
 # --------------------------------------------------------

Conclusion: data.table is 5.5 times more efficient than ave.

And_R
  • 1,647
  • 3
  • 18
  • 32
  • 1
    Great! `data.table 1.9.3` has optimised `min` calls within [GForce](https://github.com/Rdatatable/data.table/blob/master/README.md), which should result in even better timings... in case you're interested to test it out. Even better on even larger data :). – Arun Jul 29 '14 at 21:23
1

A dplyr solution

library(dplyr)
df %>% 
  group_by(A) %>%
    mutate(C = min(B))

# Source: local data frame [8 x 3]
# Groups: A
# 
#   A   B   C
# 1 3 1.2 0.7
# 2 3 3.4 0.7
# 3 3 0.7 0.7
# 4 4 4.5 4.5
# 5 4 4.9 4.5
# 6 5 0.1 0.1
# 7 5 0.4 0.1
# 8 6 5.0 5.0
David Arenburg
  • 91,361
  • 17
  • 137
  • 196