I'm an applied researcher, working primarily with nationwide registry data, who is making the transition from Stata to R. The dplyr
package has made most of my daily data management tasks work smoothly. Nevertheless, I'm currently struggling with getting R to generate new variables based on nested loops.
Suppose that we have the following dataset on six participants born between 1990-1992 with measures on their grade point averages between 2001-2004.
* Stata
clear all
input id byear gpa2000 gpa2001 gpa2002 gpa2003 gpa2004
1 1990 1.2 1.3 1.4 1.5 1.3
2 1990 2.3 2.5 2.2 2.1 2.6
3 1991 3.1 3.9 3.4 3.5 4.0
4 1991 2.6 3.1 2.4 1.9 3.1
5 1992 1.4 1.8 3.2 2.3 3.2
6 1992 3.5 4.0 4.0 4.0 3.9
end
list
+--------------------------------------------------------------+
| id byear gpa2000 gpa2001 gpa2002 gpa2003 gpa2004 |
|--------------------------------------------------------------|
1. | 1 1990 1.2 1.3 1.4 1.5 1.3 |
2. | 2 1990 2.3 2.5 2.2 2.1 2.6 |
3. | 3 1991 3.1 3.9 3.4 3.5 4 |
4. | 4 1991 2.6 3.1 2.4 1.9 3.1 |
5. | 5 1992 1.4 1.8 3.2 2.3 3.2 |
6. | 6 1992 3.5 4 4 4 3.9 |
+--------------------------------------------------------------+
Or equivalently in R:
df <- read.table(header=T, text="id byear gpa2000 gpa2001 gpa2002 gpa2003 gpa2004
1 1990 1.2 1.3 1.4 1.5 1.3
2 1990 2.3 2.5 2.2 2.1 2.6
3 1991 3.1 3.9 3.4 3.5 4.0
4 1991 2.6 3.1 2.4 1.9 3.1
5 1992 1.4 1.8 3.2 2.3 3.2
6 1992 3.5 4.0 4.0 4.0 3.9
")
I would now like to generate three new variables that measure each participant's GPA between ages 10-12 years (gpa_age10 ... gpa_age12).
In Stata, I would normally do this by the way of nested for loops:
forval i = 10/12 {
gen gpa_age`i' = .
forval j = 1990/1992 {
replace gpa_age`i' = gpa`=`j'+`i'' if byear == `j'
}
}
This would result in the following dataset:
+-----------------------------------------------------------------------------------------------+
| id byear gpa2000 gpa2001 gpa2002 gpa2003 gpa2004 gpa_a~10 gpa_a~11 gpa_a~12 |
|-----------------------------------------------------------------------------------------------|
1. | 1 1990 1.2 1.3 1.4 1.5 1.3 1.2 1.3 1.4 |
2. | 2 1990 2.3 2.5 2.2 2.1 2.6 2.3 2.5 2.2 |
3. | 3 1991 3.1 3.9 3.4 3.5 4 3.9 3.4 3.5 |
4. | 4 1991 2.6 3.1 2.4 1.9 3.1 3.1 2.4 1.9 |
5. | 5 1992 1.4 1.8 3.2 2.3 3.2 3.2 2.3 3.2 |
6. | 6 1992 3.5 4 4 4 3.9 4 4 3.9 |
+-----------------------------------------------------------------------------------------------+
I understand that there might not be a direct translation of this Stata code to R but what is the best way of replicating these results in R?