2

I have two data frames (DF1 and DF2):

(1) DF1 contains information on individual-level, i.e. on 10.000 individuals nested in 30 units across 11 years (2000-2011). It contains four variables:

  1. "individual" (numeric id for each individual; ranging from 1-10.000)
  2. "unit" (numeric id for each unit; ranging from 1-30)
  3. "date1" (a date in date format, i.e. 2000-01-01, etc; ranging from 2000-01-01 to 2010-12-31)
  4. "date2" ("Date1" + 1 year)

(2) DF2 contains information on unit-level, i.e. on the same 30 units as in DF1 across the same time period (2000-2011) and further contains a numeric variable ("x"):

  1. "unit" (numeric id for each unit; ranging from 1-30)
  2. "date" (a date in date format, i.e. 2000-01-01, etc; ranging from 2000-01-01 to 2011-12-31)
  3. "x" (a numeric variable, ranging from 0 to 200)

I would like to create new variable ("newvar") that gives me for each "individual" per "unit" the sum of "x" (DF2) counting from "date1" (DF1) to "date2" (DF2). This means that I would like to add this new variable to DF1.

For instance, if "individual"=1 in "unit"=1 has "date1"=2000-01-01 and "date2"=2001-01-01, and in DF2 "unit"=1 has three observations in the time period "date1" to "date2" (i.e. 2000-01-01 to 2001-01-01) with "x"=1, "x"=2 and "x"=3, then I would like add a new variable that gives for "individual"=1 in "unit"=1 "newvar"=6.

I assume that I need to use a for loop in R and have been using the following code:

for(i in length(DF1)){

DF1$newvar[i] <-sum(DF2$x[which(DF1$date == DF1$date1[i] &
                     DF1$date == DF1P$date1[i] &
                     DF2$unit == DF1P$unit[i]),])

}

but get the error message:

Error in DF2$x[which(DF2$date ==  : incorrect number of dimensions 

Any ideas of how to create this variable would be tremendously appreciated!

Here is a small example as well as the expected output, using one unit for the sake of simplicity:

Assume DF1 looks as follows:

individual  unit  date1        date2   
1           1     2000-01-01   2001-01-01
2           1     2000-02-02   2001-02-02
3           1     2000-03-03   2000-03-03
4           1     2000-04-04   2000-04-04
5           1     2000-12-31   2001-12-31 
(...)
996         1     2010-01-01   2011-01-01
997         1     2010-02-15   2011-02-15
998         1     2010-03-05   2011-03-05
999         1     2010-04-10   2011-04-10
1000        1     2010-12-27  2011-12-27
1001        2     2000-01-01   2001-01-01
1002        2     2000-02-02   2001-02-02
1003        2     2000-03-03   2000-03-03
1004        2     2000-04-04   2000-04-04
1005        2     2000-12-31   2001-12-31 
(...)
1996        2     2010-01-01   2011-01-01
1997        2     2010-02-15   2011-02-15
1998        2     2010-03-05   2011-03-05
1999        2     2010-04-10   2011-04-10
2000        2     2010-12-027  2011-12-27
(...)
3000        34    2000-02-02   2002-02-02
3001        34    2000-05-05   2001-05-05
3002        34    2000-06-06   2001-06-06
3003        34    2000-07-07   2001-07-07
3004        34    2000-11-11   2001-11-11
(...)
9996        34    2010-02-06   2011-02-06
9997        34    2010-05-05   2011-05-05
9998        34    2010-09-09   2011-09-09 
9999        34    2010-09-25   2011-09-25
10000       34    2010-10-15   2011-10-15

Assume DF2 looks as follows:

unit      date         x
1         2000-01-01   1
1         2000-05-01   2
1         2000-12-01   3
1         2001-01-02   10
1         2001-07-05   20
1         2001-12-31   30
(...) 
2         2010-05-05   1 
2         2010-07-01   1
2         2010-08-09   1
3         (...)

This is what I would like DF1 to look like after running the code:

individual  unit      date1        date2        newvar  
    1           1     2000-01-01   2001-01-01   6
    2           1     2000-02-02   2001-02-02   16
    3           1     2000-03-03   2001-03-03   15
    4           1     2000-04-04   2001-04-04   15
    5           1     2000-12-31   2001-12-31   60
    (...)
    996         1     2010-01-01   2011-01-01    3
    997         1     2010-02-15   2011-02-15    2
    998         1     2010-03-05   2011-03-05    2
    999         1     2010-04-10   2011-04-10    2
    1000        1     2010-12-27  2011-12-27     0
    (...)

However, I cannot simply aggregate: Imagine that in DF1 each "unit" has several hundreds of individuals for each year between 2000 and 2011. And DF2 has many observations for each unit across the years 2000-2011.

Gret-D
  • 23
  • 1
  • 5
  • 1
    Please show a small reproducible example and expected output – akrun Jun 03 '17 at 15:40
  • 1
    I didn't check the code, but the error message you get because you are treating a vector (`DF2$x`) as two dimensional, when it has only 1 dimension. You need to delete the comma before your last paranthesis: `,])` – Bea Jun 03 '17 at 15:47
  • Thank you @Bea ! The error message disappears when deleting the comma, so that is great. Unfortunately, the code does not produce correct sums, i.e. it gives the value 0 for all observations in "newvar" in DF1. – Gret-D Jun 03 '17 at 15:54
  • Thank you, @akrun! I added a small example with expected output - I hope this is what you were thinking of? I am new to this, and so I try my best to provide the information needed... – Gret-D Jun 03 '17 at 16:20

2 Answers2

3

We can use data.table

library(data.table)
setDT(DF1)
setDT(DF2)
DF1[DF2[, .(newvar = sum(x)), .(unit, individual = cumsum(date %in% DF1$date1))],
             newvar := newvar, on = .(individual, unit)]
DF1
#    individual unit      date1      date2 newvar
#1:          1    1 2000-01-01 2001-01-01      6
#2:          2    1 2001-01-02 2002-01-02     60

Or we can use a non-equi join

DF1[DF2[DF1, sum(x), on = .(unit, date >= date1, date <= date2),
        by = .EACHI], newvar := V1, on = .(unit, date1=date)]

DF1
#   individual unit      date1      date2 newvar
#1:          1    1 2000-01-01 2001-01-01      6
#2:          2    1 2001-01-02 2002-01-02     60
akrun
  • 874,273
  • 37
  • 540
  • 662
  • 1
    Thank you so much @akrun! The second option solved the issue! – Gret-D Jun 03 '17 at 17:41
  • I realized that the code is not exactly doing what I was thinking of. For each "individual" in DF1 it sums the values of "x" in DF2 from the first date in "date1" to the individual specific "date2" (cumulative from 2000-01-01 to whatever "date2" is), rather than from the individual specific "date1" to the individual specific "date2". Exp: if an individual has "date1"=2005-01-01 & "date2"=2006-01-01, the output gives the sum for "x" from "date1"=2000-01-01 to "date2"=2006-01-01, rather than correctly starting to sum at 2005-01-01. Any help on how to modify the code would be appreciated! – Gret-D Jun 04 '17 at 09:41
  • Yes, I have tried both and they give me the same result unfortunately... – Gret-D Jun 04 '17 at 10:02
  • @Gret-D I am using your example. Perhaps you can provide a new example tat gives a different output – akrun Jun 04 '17 at 10:03
  • Example is modified! I hope this helps. Basically, I want the code to sum "x" from the individual specific date1 to the individual specific date2. – Gret-D Jun 04 '17 at 10:56
  • @Gret-D Do you think the solution posted by Bea works for you. – akrun Jun 04 '17 at 10:58
  • I have tried Bea's solution for a subset of the data (only one unit), as it takes a very long time for the full set (I have 50.000 obs in the actual data frame). It produces the same result as your code, i.e. it starts to sum from the very first "date1" for that unit (rather than the individual-specific "date1", but stops correctly summing at the individual specific "date2". I am thinking hard, but I don't understand why it would do so... – Gret-D Jun 04 '17 at 11:22
  • 1
    I have found the bug - both your version and Bea's version (on a subset of data with 2 units) works now! The problem was that one of the dates was accidentaly 'mondate' format. My apologies for the confusions and thank you again! – Gret-D Jun 04 '17 at 11:38
2

You were almost there, I just modified slightly your for loop, and also made sure that the date variables are considered as such:

DF1$date1 = as.Date(DF1$date1,"%Y-%m-%d")
DF1$date2 = as.Date(DF1$date2,"%Y-%m-%d")
DF2$date = as.Date(DF2$date,"%Y-%m-%d")

for(i in 1:nrow(DF1)){
  DF1$newvar[i] <-sum(DF2$x[which(DF2$unit == DF1$unit[i] & 
                                  DF2$date>= DF1$date1[i] &
                                  DF2$date<= DF1$date2[i])]) 
}

The problem was, that you were asking DF2$date to be simultaneously == DF1$date1 & DF1$date2. And also, length(DF1) gives you the number of columns. To have the number of rows you can either use nrow(DF1), or dim(DF1)[1].

Bea
  • 1,110
  • 12
  • 20
  • 1
    Thank you so much for your answer @Bea! Unfortunately, the for loop takes a very long time to run, and I haven't had the patience to wait that long (since akrun's code worked). But just running the code without the for loop works without an error (does of course not correctly sum for units), so I think this would be a good solution if my data frames were smaller... – Gret-D Jun 03 '17 at 17:44
  • You are right, for loop can take much longer than data.table. It's good it worked out with akrun's code :) – Bea Jun 03 '17 at 17:47