2

I have a data frame "m" as shown below:

m

I am trying to find each account's most active month (with most number of V1). for example for account "2", it will be "month 6", for account 3 it will be "month 1", ....

I wrote the below loop, it works fine but just takes a long time even I only used 8000 rows, the whole data set has 250,000 rows, so the code below is not usable. Does any one can suggest a better way to achieve this?

Thanks a lot.

code

cfceric
  • 21
  • 1
  • 1
    In the future, you can type your code directly into your question and use the {} icon at the top to format it as code. This will give you syntax highlighting based on the type of code it is, and will be much easier for the rest of us to copy what you've done and get it into our code editors. – Chase Apr 15 '12 at 15:00

5 Answers5

3

You can do that using plyr

library(plyr)
ddply(m, "AccountID", subset, V1==max(V1))

EDITED: To get the result by month, just change de "id" variable

library(plyr)
ddply(m, "Month", subset, V1==max(V1))
Manuel Ramón
  • 2,490
  • 2
  • 18
  • 23
  • -1: that is not what he is asking. It would just return a line for each month/account couple with the global maximum of V1. You need to aggregate the data by month and account instead. – nico Apr 15 '12 at 10:13
  • @nico, I do not understand you negative vote. I do not know if misunderstood the question, but the solution proposed is equally valid, simply change the ID variable in the ddply function – Manuel Ramón Apr 15 '12 at 11:15
  • Sorry my bad, I seem to have misunderstood the dataset the OP posted. – nico Apr 15 '12 at 11:57
  • @ManuelRamón +1 Nice solution as well. – Christoph_J Apr 15 '12 at 11:57
2

I think Owe Jessen's comment is right and this is not the answer to the problem. So here is my shot with the help of data.table.

First, let's use an example that is a little bit easier to comprehend:

library(data.table)
DT <- data.table(AccountID = rep(1:3, each=4),
                 V1        = sample(1:100, 12, replace=FALSE),
                 Month     = rep(1:4, times=3))
      AccountID V1 Month
 [1,]         1 81     1
 [2,]         1 23     2
 [3,]         1 72     3
 [4,]         1 36     4
 [5,]         2 22     1
 [6,]         2 13     2
 [7,]         2 50     3
 [8,]         2 40     4
 [9,]         3 74     1
[10,]         3 83     2
[11,]         3  4     3
[12,]         3  3     4

So here we have 3 accounts and four months and for every account/month combination, we have a V1. So, finding the maximum V1 for each account, I do the following:

setkey(DT, AccountID)
DT <- DT[, list(maxV1=max(V1)), by="AccountID"][DT]
DT[maxV1==V1]
     AccountID maxV1 V1 Month
[1,]         1    81 81     1
[2,]         2    50 50     3
[3,]         3    83 83     2

This is a little hard to understand, so let me try to explain it a little: I set AccountID as the key for DT. Now, I basically do two steps in DT[, list(maxV1=max(V1)), by="AccountID"][DT]. First, I compute the maximum V1 value for each account (DT[, list(maxV1=max(V1)), by="AccountID"]) and then, by calling [DT] right after it, I add this new column maxV1 to the old DT. Obviously, then I only have to get all the rows for which maxV1==V1 holds.

Applying this solution to Nico's more advanced example and showing you how to convert a data.frame to a data.table:

library(data.table)
DT <- as.data.table(m)
#Note that this line is only necessary if there are more than one rows per Month/AccountID combination
DT <- DT[, sum(V1), by="Month,AccountID"]
setkey(DT, AccountID)
DT <- DT[, list(maxV1=max(V1)), by="AccountID"][DT]
DT[maxV1==V1]
   AccountID maxV1 Month    V1
           1 24660     1 24660
           2 22643     2 22643
           3 23642     3 23642
           4 22766     5 22766
           5 22445    12 22445
...

And this gives exactly 50 rows.

EDIT:

And here is a base-R solution:

df <- data.frame(AccountID = rep(1:3, each=4),
                 V1        = sample(1:100, 12, replace=FALSE),
                 Month     = rep(1:4, times=3))
df$maxV1 <- ave(df$V1, df$AccountID, FUN = max)
df[df$maxV1==df$V1, ]

I took my inspiration from here.

Community
  • 1
  • 1
Christoph_J
  • 6,804
  • 8
  • 44
  • 58
1

I do not see a way to vectorize this algorithm (if someone else does, I'd be curious to see how).

Here is how I would code it (p.s: please include self contained code in the future. look at ?dput also for help):

make.data <- function(n = 100) # 250000
{
# Generate some random data
AccountID <- sample(1:50, n, replace=T)
V1 <- sample(1:100, n, replace=T)
Month <- sample(1:12, n, replace=T)

m <- data.frame(AccountID, V1, Month)
m
}



fo <- function(X)
{
unique_ID <- unique(X$AccountID)
M_max <- numeric(length(unique_ID ))

for(i in seq_along(unique_ID))
{
    ss <- X$AccountID == unique_ID[i]
    M_max [i] <- X[ss,"Month"][which.max(X[ss,"V1"])]
}

# results:
# M_max
data.frame(unique_ID , M_max)
}


X <- make.data(1000000)
system.time(fo(X))
#   user  system elapsed 
#   2.32    0.33    2.70 

I suspect some of these functions might be faster then the ones you have used (but it is worth testing the times).

EDIT: R's new JIT might help you (you can read more about it here: Speed up your R code using a just-in-time (JIT) compiler ) I tried it with JIT too, and it didn't speed things up.

It might also be worth to parallelize your loop (but I won't go into it now).

If the timing is not realistic, there might be away to do it using the data.table package (but I do not have experience with working with it), or even go to doing it using SQL...

Good luck, Tal

UPDATE: I used nico's example, and wrapped the solution in functions. The timing is absolutely fine, no need for more advanced solutions...

Tal Galili
  • 24,605
  • 44
  • 129
  • 187
1

This is pretty much instantaneous on my laptop using 250000 rows (plus it is much cleaner)

# Generate some random data
AccountID <- sample(1:50, 250000, replace=T)
V1 <- sample(1:100, 250000, replace=T)
Month <- sample(1:12, 250000, replace=T)

m <- data.frame(AccountID, V1, Month)

# Aggregate the data by month
V1.per.month <- aggregate(m$V1, sum, by=list(Month = m$Month))

EDIT: re-reading the question I realized I forgot to take into account the account (pun intended)

This should do, however

V1.per.month <- aggregate(m$V1, sum, 
             by=list(Month = m$Month, Account= m$AccountID))

A timing graph (error bars are SD). As you can see it takes ~2.5s per 1 million rows, which is very acceptable, I think.

Elapsed time per number of rows

nico
  • 50,859
  • 17
  • 87
  • 112
  • 2
    I think this is not the answer the OP is looking for: You are aggregating the V1 per month (resulting in a df with 12 rows). The OP is looking for the most active month per account (which would be a df with 50 rows) – Owe Jessen Apr 15 '12 at 10:50
  • @Owe Jessen: I am also aggregating per account (see the edit) – nico Apr 15 '12 at 11:01
  • @nico I agree with Owe Jessen. If I get your solution right, you just get the sum of V1s for every month and account. However, I think the dataset of the OP only has one V1 value for each month/account combination. But now, he wants to find the month for which each account has the maximum V1. I posted a answer doing that with data.table (hopefully correct ;-). So if you agree, maybe you find a solution with base R. I don't because base R isn't my strong suit (I'm used to data.table now ;-) – Christoph_J Apr 15 '12 at 11:08
  • Hello dear Nico, I suspect that (as others have mentioned), your solution did not give the relevant answer to the OP's question. But thanks to your code, I updated me answer bellow to give timings for the solution - and it is quite good in base R... – Tal Galili Apr 15 '12 at 11:16
  • Having a look [here](http://stackoverflow.com/questions/8225621/faster-way-to-create-variable-that-aggregates-a-column-by-id/8227271#8227271) (that's where I took the inspiration from my answer), I guess you only need to `ave(df$V1, df$AccountID, FUN = max)` to your data.frame m and the get all the rows where V1 equals the maximum V1 per account. – Christoph_J Apr 15 '12 at 11:17
  • @Christoph_J: hmmmm you're probably right, I was assuming multiple account/month lines and not an already aggregated table. – nico Apr 15 '12 at 11:59
  • 1
    @Tal Galili: I'm worried now that the way I generate the data is not really correct... see Christoph comment. It would be nice if the OP provided a clarification on that. – nico Apr 15 '12 at 12:00
1

I suppose basically this is the same solution as Tal's

I get a reasonable time with the following loop

# Generate some random data
AccountID <- sample(1:50, 250000, replace=T)
V1 <- sample(1:100, 250000, replace=T)
Month <- sample(1:12, 250000, replace=T)

m <- data.frame(AccountID, V1, Month)

# Aggregate the data by month

ac = as.numeric(levels(as.factor(m$AccountID)))
active.month = rep(NA, length(ac))
names(active.month) = ac

system.time(
{
  for(i in ac)
  {
    subm = subset(m, AccountID == i)
    active.month[i] = subm[which.max(subm[,"V1"]),"Month"]
  }
})
   User      System verstrichen 
   0.78        0.14        0.92 
Owe Jessen
  • 247
  • 1
  • 2
  • 11