0

I have a data.frame containing information from a website log. Among other columns the data.frame contains the cookie ID, the timestamp and the URL. Now I want to compute the time spent on each page, assuming that a page is viewed until another page is loaded by the same cookie ID. This is why the data.frame is ordered by cookie ID and timestamp using:

data = data[with(data, order(cookie_id, timestamp)), ]

Here is how I calculate the time spent on each page using a for loop:

data$calc_time_spent = NA
for (i in 1:(nrow(data)-1)) {
  if (!is.na(data$cookie_id[i]) & !is.na(data$cookie_id[i+1]) & data$cookie_id[i] == data$cookie_id[i+1]) {
    data$calc_time_spent[i] = data$timestamp[i+1]-data$timestamp[i]
  }
}

Unfortunately this is very slow, so I need a more sophisticated solution, maybe using the apply-function?

-

Here are some example data:

cookie_id = c("5", "5", "8", "8", "8")
timestamp = as.POSIXlt(c("2005-4-19 7:01:33", "2005-4-19 7:01:35", "2005-4-19 7:01:10", "2005-4-19 7:01:23", "2005-4-19 7:01:27"))
data = data.frame(timestamp, cookie_id)

Which look like this:

    timestamp   cookie_id
1   2005-04-19 07:01:33 5
2   2005-04-19 07:01:35 5
3   2005-04-19 07:01:10 8
4   2005-04-19 07:01:23 8
5   2005-04-19 07:01:27 8

After the operation the data should have a third column:

    timestamp   cookie_id   calc_time_spent
1   2005-04-19 07:01:33 5   2
2   2005-04-19 07:01:35 5   NA
3   2005-04-19 07:01:10 8   13
4   2005-04-19 07:01:23 8   4
5   2005-04-19 07:01:27 8   NA
elgehelge
  • 2,014
  • 1
  • 19
  • 24
  • 2
    Something like `ave(data$timestamp, data$cookie_id, FUN=diff)`? Please provide a [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). – sgibb Aug 20 '13 at 18:34
  • I have now provided a reproducible example.. – elgehelge Aug 20 '13 at 21:26

2 Answers2

0

It looks like you just want to take the difference between vectors. Consider the following (although I could be more helpful if you provide reproducible data)

data$calc_time_spent[2:length(data$calc_time_spent)] - 
  data$calc_time_spent[1:(length(data$calc_time_spent) - 1)]

If you want to do this within each Cookie ID, consider using the "plyr" package or possibly data.table. It should be straightforward to include these commands inside a "ddply" wrapper or a "data.table" command, using

### Data Table Approach (faster)
library(data.table)
data[, time.spent := calc_time_spent[2:.N] - calc_time_spent[1:(.N - 1)], 
     by = cookie_id]

### Plyr approach (slower)
library(plyr)
ddply(data, .(cookie_id), summarize,
      time.spent = calc_time_spent[2:length(calc_time_spent)] - 
                   calc_time_spent[1:(length(calc_time_spent) - 1)])

Note the .N is a built in constant within Data.Table which is a length 1 vector reporting the number of observations within each "cookie_id" group.

Of course, you could also use a built in function, such as "diff", if you wanted.

Andreas
  • 1,923
  • 19
  • 24
  • I get the following error when trying the table solution: Error in `[.data.frame`(data, , `:=`(time.spent, calc_time_spent[2:.N] - : unused argument (by = "cookie_id") Btw. I have provided an example now. – elgehelge Aug 20 '13 at 21:26
  • 1
    I think it would be safer to use diff in case `.N =1` – mnel Aug 20 '13 at 23:24
0

Here is a solution. It a hack. I'm still interested in knowing how to apply a function the works on two rows at a time, and also have a conditional statement saying that "if SOMETHING, then use NA".

data$timestamp_next = c(data$timestamp[2:length(data$timestamp)], data$timestamp[1])
data$time_on_page = data$timestamp_next - data$timestamp
data$cookie_id_int = as.integer(data$cookie_id)
data$cookie_id_int_next = c(data$cookie_id_int[2:length(data$cookie_id_int)], data$cookie_id_int[1])
data$cookie_id_same_as_next = data$cookie_id_int == data$cookie_id_int_next
data$time_on_page[data$cookie_id_same_as_next == FALSE | is.na(data$cookie_id_same_as_next)] = NA
data$cookie_id_int = NULL
data$cookie_id_int_next = NULL
data$cookie_id_same_as_next = NULL
elgehelge
  • 2,014
  • 1
  • 19
  • 24