0

I starting to data-mine a mobile application, and I have a database that looks like this:

Database
UserId         Hour         Date
01                18           01.01.2016
01                18           01.01.2016
01                14           02.01.2016
01                14           03.01.2016
02                21           03.01.2016
02                08           05.01.2016
02                08           05.01.2016
03                23           05.01.2016

I would like to add a new column to this database that sums the number of different days the user has been using the application, In this database for example UserId#01 has been on the platform in three different days,

Expected data outcomes like this:

Database
UserId         Hour         Date                 NumDates
01                18           01.01.2016        3
01                18           01.01.2016        3
01                14           02.01.2016        3
01                14           03.01.2016        3 
02                21           03.01.2016        2
02                08           05.01.2016        2
02                08           05.01.2016        2
03                23           05.01.2016        1

So far I have used this command:

Database["NumDates"] % group_by(UserId) %>% summarise(NumDates = length(unique(Date)))

But it tells me that that it is creating only 5000 lines (the number of different users in my database) when I need +600,000 (the number of sessions in my database)

If somebody could help me with this, it will be greatly appreciated!

Alban Couturier
  • 129
  • 2
  • 8

3 Answers3

1

We can use uniqueN from data.table

library(data.table)
setDT(Database)[, NumDates := uniqueN(Date) , by = UserId]
Database
#   UserId Hour       Date NumDates
#1:      1   18 01.01.2016        3
#2:      1   18 01.01.2016        3
#3:      1   14 02.01.2016        3
#4:      1   14 03.01.2016        3
#5:      2   21 03.01.2016        2
#6:      2    8 05.01.2016        2
#7:      2    8 05.01.2016        2
#8:      3   23 05.01.2016        1
akrun
  • 874,273
  • 37
  • 540
  • 662
0

you could use n_distict in dplyr

library("dplyr")
database<- data.frame(UserId = c(1,1,1,1,2,2,2,3), Hour = c(18,18,14,14,21,8,8,23), Date = c("01.01.2016","01.01.2016","02.01.2016","03.01.2016","03.01.2016","05.01.2016","05.01.2016","05.01.2016"))
database %>% group_by(userId) %>% mutate(NumDates = n_distinct(Date))

the result is as follows

   UserId  Hour       Date NumDates
    (dbl) (dbl)     (fctr)    (int)
1      1    18 01.01.2016        3
2      1    18 01.01.2016        3
3      1    14 02.01.2016        3
4      1    14 03.01.2016        3
5      2    21 03.01.2016        2
6      2     8 05.01.2016        2
7      2     8 05.01.2016        2
8      3    23 05.01.2016        1
ArunK
  • 1,731
  • 16
  • 35
0

You don't want summarise here but mutate. summarise will give you one row by distinct value of the column you grouped by, while mutate will just add another column and preserving existing ones.

Tutuchan
  • 1,527
  • 10
  • 19