0

I have two dataframes in R: A.df and B.df. The first contains N rows where each row is an event that happened in a certain date and place.

The second is a list of individuals that have been interviewed in a certain date and place.

For each individual, I would like to count the number of events that happened within a certain timeframe before the interview date in the same location of the individual's place of interview.

Let's say that the time frame is x days before the date of interview, and that I have computed that date and stored in the variable xdaysbefore.

Here below how the data frames look like

A.df

#Event     Date     Place
   1   2015-05-01     1    
   2   2015-03-11     1
   3   2015-07-04     2
   4   2015-05-10     3

B.df

#Individual  Date of Interview   Place   xdaysbefore
       1         2016-07-11        1       2014-09-11  
       2         2016-05-07        3       2014-07-04
       3         2016-08-09        2       2014-03-22
       4         2016-01-10        3       2014-09-17

Note that Date, Date of Interview and xdaysbefore are all in Date R class

How can I count for each individual in B.df the events happened within the time frame Date of Interview - xdaysbefore according to the place in which the event has happened and the individual place of interview.

What I would expect in B.df would look like this:

B.df

#Individual  Date of Interview   Place   xdaysbefore      CountedEvents
       1         2016-07-11        1       2014-09-11         2
       2         2016-05-07        3       2014-07-04         1
       3         2016-08-09        2       2014-03-22         1
       4         2016-01-10        3       2014-09-17         1

where CountedEvents are the number of events happened in the time frame Date of Interview - xdaysbefore and in the same location where the individual i has been interviewed.

Jaap
  • 81,064
  • 34
  • 182
  • 193
Caserio
  • 472
  • 1
  • 3
  • 14
  • Can you provide a [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) (something we can copy/paste into R)? – Roman Luštrik Oct 24 '16 at 08:27
  • So basically, you want to find for a place in `B.df` , number of events that occurred in `A.df` in the range of `xdaysbefore to Date Of interview` at the same place. – Ronak Shah Oct 24 '16 at 08:28

2 Answers2

1

You can use apply on every row of B.df.

Take a subset of A.df where places are equal. Check if the Date in A.df is within the range of Date_of_Interview and xdaysbefore

B.df$CountedEvents <- apply(B.df, 1, function(x) {
    temp = A.df[A.df$Place %in% x[3],]
    length(temp$Date < as.Date(x[2]) & temp$Date > as.Date(x[4]))
 })

B.df
#     Individual Date_of_Interview Place xdaysbefore CountedEvents
#1          1        2016-07-11     1      2014-09-11       2
#2          2        2016-05-07     3      2014-07-04       1
#3          3        2016-08-09     2      2014-03-22       1
#4          4        2016-01-10     3      2014-09-17       1

EDIT

If you want to access columns with names instead of indexes, you can use

apply(B.df, 1, function(x) {
        temp = A.df[A.df$Place %in% x["Place"],]
        length(temp$Date < as.Date(x["Date_of_Interview"]) & 
               temp$Date > as.Date(x["xdaysbefore"]))
})
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
  • 1
    Thank you, it works. What if I want to improve it in order to count only events of a certain type? Let's say that in A.df there is an additional column "EventType" which shows values 1,2,3,4. And I want to count only events of type 1. – Caserio Oct 24 '16 at 08:53
  • Just change `temp = A.df[A.df$Place %in% x[3],]` to `temp = A.df[A.df$Place %in% x[3] & A.df$EventType ==1`,] – Ronak Shah Oct 24 '16 at 08:56
  • Since I have many variables in my dataset I would like to invoke them by name. I think there is something wrong in the code below. It produce the variable CountedEvents with no variation just a single value for all rows, that value corresponds to the number of observations in B.df. Code: `B.df$CountedEvents <- apply(B.df, 1, function(x) { temp = A.df[A.df$Place %in% B.df$Place,] length(temp$Date < as.Date(B.df$date) & temp$Date > as.Date(B.df$xdaysbefore)) })` – Caserio Oct 24 '16 at 16:27
  • @HamidOskorouchi okay..I have updated the answer. Please check if that is what you want. – Ronak Shah Oct 25 '16 at 04:28
0

You can achieve that by using a combination of merge and aggregate:

# merge into a new dataset
AB <- merge(A, B, by = 'Place', all = TRUE)

# create a logical variable which indicates whether 'Date' falls within the range
AB$count <- AB$xdaysbefore < AB$Date & AB$Date_of_Interview > AB$Date

# aggregate into a count varaible
aggregate(count ~ Individual + Date_of_Interview + xdaysbefore, AB, sum)

which gives:

  Individual Date_of_Interview xdaysbefore count
1          3        2016-08-09  2014-03-22     1
2          2        2016-05-07  2014-07-04     1
3          1        2016-07-11  2014-09-11     2
4          4        2016-01-10  2014-09-17     1

Alternatively you could use the new non-equi join possibility from the development version of the data.table package:

library(data.table)

# convert the dataframes to data.table's (which are enhanced dataframes)
setDT(A)
setDT(B)

# join and count
A[B, on = .(Place, Date < Date_of_Interview, Date > xdaysbefore)
  ][, .(count = .N), .(Individual, Place, Date_of_Interview = Date, xdaysbefore = Date.1)]

which gives:

   Individual Place Date_of_Interview xdaysbefore count
1:          1     1        2016-07-11  2014-09-11     2
2:          2     3        2016-05-07  2014-07-04     1
3:          3     2        2016-08-09  2014-03-22     1
4:          4     3        2016-01-10  2014-09-17     1
Jaap
  • 81,064
  • 34
  • 182
  • 193