0

I have another question in the same project scope pandas dataframe groupby datetime month however I fear the data structure might be to complicated so I am trying an alternative approach. I am hoping this achieves the same result.

I am ideally looking to build a matrix of phone numbers as rows and start and end dates as columns and identify the period in which a telephone call was made.

This will be achieved by transforming a dataset of dates and phone numbers to a complete list of dates, identifying an end day match, and then seeing if the date the telephone call was made falls within that period.

The original data looks like:

    Date = as.Date(c("2019-03-01", "2019-03-15","2019-03-29",      "2019-04-10","2019-03-05","2019-03-20"))
    Phone =   c("070000001","070000001","070000001","070000001","070000002","070000002")
    df<-data.frame(Date,Phone)

    df

    ##         Date     Phone
    ## 1 2019-03-01 070000001
    ## 2 2019-03-15 070000001
    ## 3 2019-03-29 070000001
    ## 4 2019-04-10 070000001
    ## 5 2019-03-05 070000002
    ## 6 2019-03-20 070000002

Ideally I would want it to look like this:

    ##         Date     Phone INT_1 INT_2 INT_3 INT_4 INT_5
    ## 1 2019-03-01 070000001     X     X     X     X     X
    ## 2 2019-03-15 070000002                 X     X     X

Where INT is a series of dates + 30 and X indicates that the telephone number appeared in that rolling period.

To do this I assume you need two datasets. The one above, of telephone numbers by date called, and a second which is the complete list of days and their = 30 day counter parts.

    dates<-as.data.frame(seq(as.Date("2016/7/1"), as.Date("2019/7/1"),"days"), 
    responseName = c('start'))

    dates$end<-dates$start+30

    ## INT       start        end
    ##   1 2016-07-01 2016-07-31
    ##   2 2016-07-02 2016-08-01
    ##   3 2016-07-03 2016-08-02
    ##   4 2016-07-04 2016-08-03

But how do I get the two to evaluate together? I am assuming some kind of merge and expand of the telephone data into the date list then spread the dates by the row index/ INT?

1 Answers1

0

I think that to match the two dataframes you could use a fuzzyjoin. For example, if I define a dataframe of phone numbers and usage dates as:

library(dplyr)
library(fuzzyjoin)
fake_phone_data <- tibble(
  date = as.Date(c("2019-01-03", "2019-01-27", "2019-02-12", "2019-02-25", "2019-02-26")), 
  phone = c("1", "1", "2", "2", "2")
)

and a dataframe of starting/ending dates (plus an ID column) as:

id_dates <- tibble(
  ID = c("1", "2", "3", "4"), 
  starting_date = as.Date(c("2019-01-01", "2019-01-16", "2019-02-01", "2019-02-16")), 
  ending_date = as.Date(c("2019-01-15", "2019-01-31", "2019-02-15", "2019-02-27"))
)

then I can join the two dataframes using a fuzzyjoin, i.e. two rows are matched if the date of the phone call happens between the starting date and the end date of the corresponding period:

fuzzy_left_join(
  fake_phone_data, 
  id_dates, 
  by = c(
    "date" = "starting_date", 
    "date" = "ending_date"
  ),
  match_fun = list(`>=`, `<`)
)
#> # A tibble: 5 x 5
#>   date       phone ID    starting_date ending_date
#>   <date>     <chr> <chr> <date>        <date>     
#> 1 2019-01-03 1     1     2019-01-01    2019-01-15 
#> 2 2019-01-27 1     2     2019-01-16    2019-01-31 
#> 3 2019-02-12 2     3     2019-02-01    2019-02-15 
#> 4 2019-02-25 2     4     2019-02-16    2019-02-27 
#> 5 2019-02-26 2     4     2019-02-16    2019-02-27

Created on 2019-07-19 by the reprex package (v0.3.0)

Does it solve your problem?

This approach is very similar to this question.

agila
  • 3,289
  • 2
  • 9
  • 20
  • That looks like a valid solution, @agila . It keeps the the data in long format. Even easier for transforming. Thank you so much! That is a great package, I will need to use it in the future. – Jacob Baisley Jul 22 '19 at 08:01