2

In R (which I am relatively new to) I have a data frame consists of many column and a numeric column I need to aggregate according to groups determined by another column.

 SessionID   Price
 '1',       '624.99'
 '1',       '697.99'
 '1',       '649.00'
 '7',       '779.00'
 '7',       '710.00'
 '7',       '2679.50'

I need to group by SessionID and return the Max and Min for each ONTO the original data frame e.g. :

 SessionID   Price     Min     Max
 '1',       '624.99'   624.99  697.99
 '1',       '697.99'   624.99  697.99
 '1',       '649.00'   624.99  697.99
 '7',       '779.00'   710.00  2679.50
 '7',       '710.00'   710.00  2679.50
 '7',       '2679.50'  710.00  2679.50

any ideas how to do this efficiently ?

user3628777
  • 529
  • 3
  • 10
  • 20

3 Answers3

4

Using base R:

df <- transform(df, Min = ave(Price, SessionID, FUN = min),
                    Max = ave(Price, SessionID, FUN = max))
df
#  SessionID   Price    Min     Max
#1         1  624.99 624.99  697.99
#2         1  697.99 624.99  697.99
#3         1  649.00 624.99  697.99
#4         7  779.00 710.00 2679.50
#5         7  710.00 710.00 2679.50
#6         7 2679.50 710.00 2679.50

Since your desired result is not aggregated but just the original data with two extra columns, you want to use ave in base R instead of aggregate, which you would typically use if you wanted to aggregate the data by SessionID. (NB: AEBilgrau shows that you could also do it with aggregate with some additional matching.)

Similarly, for dplyr, you want to use mutate instead of summarise because you dont want to aggregate/summarise the data.

Using dplyr:

library(dplyr)
df <- df %>% group_by(SessionID) %>% mutate(Min = min(Price), Max = max(Price))
talat
  • 68,970
  • 21
  • 126
  • 157
3

Using data.table package:

library(data.table)

dt = data.table(SessionID=c(1,1,1,7,7,7), Price=c(624,697,649,779,710,2679))

dt[, c("Min", "Max"):=list(min(Price),max(Price)), by=SessionID]
dt
#   SessionId Price Min  Max
#1:         1   624 624  697
#2:         1   697 624  697
#3:         1   649 624  697
#4:         7   779 710 2679
#5:         7   710 710 2679
#6:         7  2679 710 2679

In your case if you have a data.frame df, just do dt=as.data.table(df) and use the code above.

I am curious about the benchmark of the solutions on an average data.frame:

df = data.frame(SessionID=rep(1:1000, each=100), Price=runif(100000, 1, 2000))
dt = as.data.table(df)

algo1 <- function() 
{
    df %>% group_by(SessionID) %>% mutate(Min = min(Price), Max = max(Price))
}

algo2 <- function()
{
    dt[, c("Min", "Max"):=list(min(Price),max(Price)), by=SessionID]
}

algo3 <- function()
{
    tmp <- aggregate(Price ~ SessionID, df, function(x) c(Min = min(x), Max = max(x)))
    cbind(df, tmp[match(df$SessionID, tmp$SessionID), 2])
}

algo4 <- function()
{
    transform(df, Min = ave(Price, SessionID, FUN = min), Max = ave(Price, SessionID, FUN = max))
}   



#> system.time(algo1())
#   user  system elapsed 
#   0.03    0.00    0.19 

#> system.time(algo2())
#   user  system elapsed 
#   0.01    0.00    0.01 

#> system.time(algo3())
#   user  system elapsed 
#   0.77    0.01    0.78 

#> system.time(algo4())
#   user  system elapsed 
#   0.02    0.01    0.03 
Colonel Beauvel
  • 30,423
  • 11
  • 47
  • 87
  • 2
    Why would you bet `dplyr` solution would be faster? In fact, `dplyr` solution should be even slower because unlike `data.table`, it does *not* update by reference, thus you also need to create a copy and override the original data set using `<-` in order to mimic the exact behaviour of `data.table`s `:=` operator (everyone seem to, somewhat conveniently, forget that...) – David Arenburg Jan 13 '15 at 11:18
  • I used primarily the column with the typo SessionId and renamed it to SessionID ;) ! And my plan was never to say that `dplyr` is faster than `data.table`, I simply would have expected something like `0.1` instead of `0.19`, that's all ...I agree with `microbenchmark`, but I can't install it at my workplace due to some restrictions :) ! I was just curious to compare different possibilities on the timing and having a rough idea! – Colonel Beauvel Jan 13 '15 at 11:35
  • I think we should stop these fights over every single aggregation question. I think everything is summed up pretty much nicely [here](http://stackoverflow.com/questions/21435339/data-table-vs-dplyr-can-one-do-something-well-the-other-cant-or-does-poorly/) – David Arenburg Jan 13 '15 at 11:39
  • @DavidArenburg, I agree that we should stop milliseconds fights over every single aggregation question, especially when they are not asking explicitly for the single most efficient way of doing it.. – talat Jan 13 '15 at 11:48
1

Here's my solution using aggregate.

First, load the data:

df <- read.table(text = 
"SessionID   Price
'1'       '624.99'
'1'       '697.99'
'1'       '649.00'
'7'       '779.00'
'7'       '710.00'
'7'       '2679.50'", header = TRUE) 

Then aggregate and match it back to the original data.frame:

tmp <- aggregate(Price ~ SessionID, df, function(x) c(Min = min(x), Max = max(x)))
df <- cbind(df, tmp[match(df$SessionID, tmp$SessionID), 2])
print(df)
#  SessionID   Price    Min     Max
#1         1  624.99 624.99  697.99
#2         1  697.99 624.99  697.99
#3         1  649.00 624.99  697.99
#4         7  779.00 710.00 2679.50
#5         7  710.00 710.00 2679.50
#6         7 2679.50 710.00 2679.50

EDIT: As per the comment below, you might wonder why this works. It indeed is somewhat weird. But remember that a data.frame just is a fancy list. Try to call str(tmp), and you'll see that the Price column itself is 2 by 2 numeric matrix. It gets confusing as the print.data.frame knows how to handle this and so print(tmp) looks like there are 3 columns. Anyway, tmp[2] simply access the second column/entry of the data.frame/list and returns that 1 column data.frame while tmp[,2] access the second column and return the data type stored.

Anders Ellern Bilgrau
  • 9,928
  • 1
  • 30
  • 37
  • that seems to work - HOWEVER, I'm not sure how - because tmp returns a 2 column dataframe with the columns: session_id , price. In spight of that, the final dataframe contains 2 additional columns : Min and Max. Can you explain ? – user3628777 Jan 13 '15 at 11:30
  • @user3628777 I've modified the post to explain. Hope that helps. – Anders Ellern Bilgrau Jan 13 '15 at 11:35