r - Adding a row index based on a combination of multiple columns in a large dataframe

Question

I have a relatively large data frame spanning 5.5M observations containing 2 variables, here is a sample of the data. It contains information about days when a user generated a pageview.

 timestamp     user_id
2013-11-07  ff268cef0c29 
2013-11-02  12bb7af7a842 
2013-11-30  e45abb10ae0b 
2013-11-06  e45abb10ae0b
2013-11-25  f266f8c9580e

Note, that the data frame is ordered by first the user_id and then the timestamp, through

kstaord<-kstaord[order(kstaord$timestamp,kstaord$user_id),]

My goal is to create an index column, that would index all page views of a user based on the date of their occurrence, which should result in something like this:

 timestamp     user_id     index
2013-11-07  ff268cef0c29     1 
2013-11-02  12bb7af7a842     1 
2013-11-30  e45abb10ae0b     1 
2013-11-06  e45abb10ae0b     2
2013-11-25  f266f8c9580e     1

So far I have tried the for loop:

for (j in 2:nrow(kstaord)) {
if (kstaord$user_id[j]!=kstaord$user_id[j-1]) {kstaord$index[j]<-1}
else {kstaord$index[j]<-kstaord$index[j-1]+1}
if ((j %% 100000)==0) {print(".")}
}

Which takes a lot of time, yet creates inconsistent results. I have found a stack overflow thread here which deals with a similar issues. I have tried the following suggestion this thread, which I am unsure of (my search for the ave function says it produces average values):

index <- ave( 1:nrow(kstaord), kstaord$user_id, factor( kstaord$timestamp),          
FUN=function(x) 1:length(x) )

I was however unable to let this function run its full course (I am working on a server installation of R and currently I have a bad connection, so I can not be sure this doesn't happen due to my connection problems). The original function also had the secondary dimension defined as a factor, while I am using a date variable so this might be an issue as well.

I am looking for a solution that would give consistent results and would run without problems on a server where I can use almost 24 GB RAM at most.

Thanks in advance for any help.

score 2 · Accepted Answer · answered Jan 04 '14 at 10:03

2

Use a data.table:

library(data.table)
DT <- as.data.table(dat)
DT[, index := seq_len(.N), by = user_id]
   timestamp      user_id index
1: 2013-11-07 ff268cef0c29     1
2: 2013-11-02 12bb7af7a842     1
3: 2013-11-30 e45abb10ae0b     1
4: 2013-11-06 e45abb10ae0b     2
5: 2013-11-25 f266f8c9580e     1

answered Jan 04 '14 at 10:03

agstudy

119,832
17
199
261

Thanks, worked fine. While I have started using R only recently, I am always happy to find that for every problematic solution, there is a simple and efficient way to work with the data that is available either by default or through a package. – Tajpi Jan 04 '14 at 12:31

Henrik · Answer 2 · 2014-01-04T10:44:48.853

An ave alternative. If I understood your question correctly, you don't need 'timestamp' as a grouping variable. It should be enough with 'user_id'.

df$index <- with(df, ave(user_id, user_id, FUN = seq_along))

df
#    timestamp      user_id index
# 1 2013-11-07 ff268cef0c29     1
# 2 2013-11-02 12bb7af7a842     1
# 3 2013-11-30 e45abb10ae0b     1
# 4 2013-11-06 e45abb10ae0b     2
# 5 2013-11-25 f266f8c9580e     1

And yes, you are correct that the default function in ave "produces average values" (FUN = mean). But you may set that FUNction to whatever function that suits your needs.

On larger data sets, data.table alternatives are most likely faster (such as the answer provided by @agstudy).

r - Adding a row index based on a combination of multiple columns in a large dataframe

2 Answers2