I have a relatively large data frame spanning 5.5M observations containing 2 variables, here is a sample of the data. It contains information about days when a user generated a pageview.
timestamp user_id
2013-11-07 ff268cef0c29
2013-11-02 12bb7af7a842
2013-11-30 e45abb10ae0b
2013-11-06 e45abb10ae0b
2013-11-25 f266f8c9580e
Note, that the data frame is ordered by first the user_id and then the timestamp, through
kstaord<-kstaord[order(kstaord$timestamp,kstaord$user_id),]
My goal is to create an index column, that would index all page views of a user based on the date of their occurrence, which should result in something like this:
timestamp user_id index
2013-11-07 ff268cef0c29 1
2013-11-02 12bb7af7a842 1
2013-11-30 e45abb10ae0b 1
2013-11-06 e45abb10ae0b 2
2013-11-25 f266f8c9580e 1
So far I have tried the for loop:
for (j in 2:nrow(kstaord)) {
if (kstaord$user_id[j]!=kstaord$user_id[j-1]) {kstaord$index[j]<-1}
else {kstaord$index[j]<-kstaord$index[j-1]+1}
if ((j %% 100000)==0) {print(".")}
}
Which takes a lot of time, yet creates inconsistent results. I have found a stack overflow thread here which deals with a similar issues. I have tried the following suggestion this thread, which I am unsure of (my search for the ave function says it produces average values):
index <- ave( 1:nrow(kstaord), kstaord$user_id, factor( kstaord$timestamp),
FUN=function(x) 1:length(x) )
I was however unable to let this function run its full course (I am working on a server installation of R and currently I have a bad connection, so I can not be sure this doesn't happen due to my connection problems). The original function also had the secondary dimension defined as a factor, while I am using a date variable so this might be an issue as well.
I am looking for a solution that would give consistent results and would run without problems on a server where I can use almost 24 GB RAM at most.
Thanks in advance for any help.