0

I have researched "batch variables" but I'm still not fully comfortable with their use yet.

I have a data frame with a column filled with different phone numbers. For example:

111-111-1111
111-111-1111
222-222-2222
222-222-2222
222-222-2222
222-222-2222
333-333-3333
333-333-3333
333-333-3333

And another column that shows the date that the calls were made, respectively. For example:

09/01/15
09/02/15
09/03/15
09/04/15
09/05/15
09/06/15
09/07/15
09/08/15
09/09/15

I would like to get a view of how many days there are between calls per mobile device phone number. Of course, this example is very simple. However, I have a data set with 27,000 entries. I need help with creating batch variables and loops (if necessary).

I am using the "lubridate" packages for the date reading and the "plyr" package for the count function which is of interest to me so I can get a view on how many times this calls repeat.

Goal: Find the average time (days) between Call 1 and Call 2, between Call 2 and Call 3, between Call i and Call i+1.

I am a very new R user. I have searched extensively for a solution to this type of problem. Thank you to anyone willing to help.

el_dewey
  • 97
  • 10

1 Answers1

1

With library dplyr, you can do something like this:

library(dplyr)
df %>% group_by(phone) %>% mutate(daysBetweeenCalls = as.numeric(difftime(date, lag(date), units = 'days')))

Ensure that the date field is in date format. You can do something like this:

df$date <- as.Date(df$date, format = '%m/%d/%Y')

Output will be as follows:

Source: local data frame [9 x 3]
Groups: phone [3]

         phone       date daysBetweeenCalls
         (chr)     (date)             (dbl)
1 111-111-1111 0015-09-01                NA
2 111-111-1111 0015-09-02                 1
3 222-222-2222 0015-09-03                NA
4 222-222-2222 0015-09-04                 1
5 222-222-2222 0015-09-05                 1
6 222-222-2222 0015-09-06                 1
7 333-333-3333 0015-09-07                NA
8 333-333-3333 0015-09-08                 1
9 333-333-3333 0015-09-09                 1

First row of each phone number is NA since there was no call before then.

Gopala
  • 10,363
  • 7
  • 45
  • 77
  • Thank you, user3949008, for you input. This help a lot, and I've been able to clean my whole script significantly. I think what makes my question go slightly deeper than what I've seen elsewhere is that I would also like to be able to add a filter so that I can extract only the time between calls 1 and 2, then only between those numbers that have calls 2 and 3, and so on. In my data, I have some entries that have a total of 8 calls. I'd like to pull out times between each step. – el_dewey Jan 13 '16 at 22:51
  • Above code gives you time between successive calls. If you want to filter out rows beyond a certain number of calls, you can do that using filter() or slice() in dplyr. – Gopala Jan 13 '16 at 23:25