3

I am currently working on an application where I have a dataframe that looks like this:

Database
UserId         Hour         Date
01                18           01.01.2016
01                18           01.01.2016
01                14           02.01.2016
01                14           02.01.2016
02                21           02.01.2016
02                08           05.01.2016
02                08           05.01.2016
03                23           05.01.2016

Each line represents a session.

I need to determine whether the time of the first session of a user has an impact on the number of sessions this user is going to have.

I have tried the command summaryBy:

library(doBy)
first_hour <- summaryBy(UserId + Hour + Date ~ UserId, 
    FUN=c(head, length, unique), database)

But it doesn't give me the correct result.

My goal here is to determine the Hour of the first session a user takes, determine how many sessions and how many different session dates a user has.

Daniel Widdis
  • 8,424
  • 13
  • 41
  • 63
Alban Couturier
  • 129
  • 2
  • 8
  • Please show the expected output. Perhaps `library(data.table); setDT(df1)[, .N ,names(df1)]` – akrun May 30 '16 at 10:13

3 Answers3

2

We can use data.table. Convert the 'data.frame' to 'data.table' (setDT(df1)), grouped by 'UserId', we order the 'Date', get the first 'Hour', total number of sessions (.N) and the number of unique Date elements (uniqueN(Date)).

library(data.table)
setDT(df1)[order(UserId, as.Date(Date, "%m.%d.%Y")),.(Hour = Hour[1L],
      Sessions = .N, DifferSessionDate = uniqueN(Date)) , by = UserId]
#    UserId Hour Sessions DifferSessionDate
#1:      1   18        4                 2
#2:      2   21        3                 2
#3:      3   23        1                 1
akrun
  • 874,273
  • 37
  • 540
  • 662
  • Thank you although it should give me the hour of the first session, in this case for example UserId#1 have its first session at 18 and not at 14 – Alban Couturier May 30 '16 at 10:21
0

You could also do this using dplyr:

library(dplyr)
dt %>% group_by(UserId) %>% summarise(FirstHour = min(Hour),
                                      NumSessions = n(),
                                      NumDates = length(unique(Date)))

Source: local data frame [3 x 4]

  UserId FirstHour NumSessions NumDates
   (int)     (int)       (int)    (int)
1      1        14           4        2
2      2         8           3        2
3      3        23           1        1
David_B
  • 926
  • 5
  • 7
0

Using base commands, you can write your own function to select desired information:

user.info <- function(user){
    temp <- subset(Database, Database$UserId == user)
    return(c(UserId=user, FirstHour=temp$Hour[1], Sessions=nrow(temp), Dates=length(unique(temp$Date))))
}

t(sapply(unique(Database$UserId), FUN=user.info)) 
#     UserId FirstHour Sessions Dates
# [1,]      1        18        4     2
# [2,]      2        21        3     2
# [3,]      3        23        1     1

Here, FirstHour is the hour on the first listed row for the given user, Sessions is the number of rows for the user and Dates is the number of different dates listed for the user.

The function is applied to all unique users and the final table is transposed.

nya
  • 2,138
  • 15
  • 29