0

How to generate cumulative distinct count with data condition in R

I have a problem in R. The dataset contains 9 categorical variables so I have to manipulate numeric variables to convert the categorical variable to numbers. The new variables I want to build look like this, for instance, the distinct number homephone of the same SSN in that record in last 7 days. Parts of the code I have written is as follows. It took me 2 hours to do the calculation for only 30% records in the 100k dataset. How could I optimize this code to make it more efficient? Thank you.

Original variables: record date ssn firstname lastname address zip dob homephone

Data sample: Here is a link to an image of my one of my data samples

### Data sample[enter image description here][1]
|record | date      | ssn       |homephone  ||date1   |dscnt|
|52     | 20150101  |737610282  |9105580920 |20150101 | 1   |
|91     | 20150101  |737610282  |3301371772 |20150101 | 2   |
|144    | 20150101  |737610282  |4139873682 |20150101 | 3   |
|236    | 20150101  |737610282  |4345558469 |20150101 | 4   |
|272    | 20150101  |737610282  |3979419673 |20150101 | 5   |
|282    | 20150102  |737610282  |8261221737 |20150102 | 6   |
|330    | 20150102  |737610282  |9105580920 |20150102 | 6   |
|412    | 20150102  |737610282  |7115050432 |20150102 | 7   |
|493    | 20150102  |737610282  |6872924382 |20150102 | 8   |
|571    | 20150103  |737610282  |2528003837 |20150103 | 9   |
|621    | 20150103  |737610282  |9105580920 |20150103 | 9   |
|763    | 20150103  |737610282  |581105006  |20150103 | 10  |
|854    | 20150104  |737610282  |9105580920 |20150104 | 10  |
|883    | 20150104  |737610282  |979778244  |20150104 | 11  |
|966    | 20150104  |737610282  |9952278232 |20150104 | 12  |
|974    | 20150104  |737610282  |3664924933 |20150104 | 13  |
|1057   | 20150104  |737610282  |9105580920 |20150104 | 13  |
|1091   | 20150104  |737610282  |32384275   |20150104 | 14  |
|1166   | 20150105  |737610282  |942052630  |20150105 | 15  |
|1304   | 20150105  |737610282  |9105580920 |20150105 | 15  |
|1416   | 20150105  |737610282  |1384461737 |20150105 | 16  |

These are the codes:

Name of the same SSN, Address, Phone for the last 7 days
    i= 1
    e1 = NULL

    for (i in 1:365)
    {
      i = 360
      a1 <- df0 %>% filter(date <= DT[365-i,])
      b1 <- 
        a1 %>%
        select(record,date,ssn,homephone) %>%
        mutate(date1 = date + days(as.integer(today() -max(a$date)) )) %>%
        filter(date1 >= today() - days(7)) %>%
        group_by(ssn) %>%
        mutate(dscnt = n_distinct(homephone)) 

      f1 = b1 %>% filter(date == DT[365-i,])

      e1 = rbind(e1,f1)
    }

0 Answers0