4

I'm an applied researcher, working primarily with nationwide registry data, who is making the transition from Stata to R. The dplyr package has made most of my daily data management tasks work smoothly. Nevertheless, I'm currently struggling with getting R to generate new variables based on nested loops.

Suppose that we have the following dataset on six participants born between 1990-1992 with measures on their grade point averages between 2001-2004.

* Stata
clear all
input id byear gpa2000 gpa2001 gpa2002 gpa2003 gpa2004
1 1990 1.2 1.3 1.4 1.5 1.3
2 1990 2.3 2.5 2.2 2.1 2.6
3 1991 3.1 3.9 3.4 3.5 4.0
4 1991 2.6 3.1 2.4 1.9 3.1
5 1992 1.4 1.8 3.2 2.3 3.2
6 1992 3.5 4.0 4.0 4.0 3.9
end
list

     +--------------------------------------------------------------+
     | id   byear   gpa2000   gpa2001   gpa2002   gpa2003   gpa2004 |
     |--------------------------------------------------------------|
  1. |  1    1990       1.2       1.3       1.4       1.5       1.3 |
  2. |  2    1990       2.3       2.5       2.2       2.1       2.6 |
  3. |  3    1991       3.1       3.9       3.4       3.5         4 |
  4. |  4    1991       2.6       3.1       2.4       1.9       3.1 |
  5. |  5    1992       1.4       1.8       3.2       2.3       3.2 |
  6. |  6    1992       3.5         4         4         4       3.9 |
     +--------------------------------------------------------------+

Or equivalently in R:

df <- read.table(header=T, text="id byear gpa2000 gpa2001 gpa2002 gpa2003 gpa2004
1 1990 1.2 1.3 1.4 1.5 1.3
2 1990 2.3 2.5 2.2 2.1 2.6
3 1991 3.1 3.9 3.4 3.5 4.0
4 1991 2.6 3.1 2.4 1.9 3.1
5 1992 1.4 1.8 3.2 2.3 3.2
6 1992 3.5 4.0 4.0 4.0 3.9
") 

I would now like to generate three new variables that measure each participant's GPA between ages 10-12 years (gpa_age10 ... gpa_age12).

In Stata, I would normally do this by the way of nested for loops:

forval i = 10/12 {
    gen gpa_age`i' = .
    forval j = 1990/1992 {
        replace gpa_age`i' = gpa`=`j'+`i'' if byear == `j'
    }
}

This would result in the following dataset:

     +-----------------------------------------------------------------------------------------------+
     | id   byear   gpa2000   gpa2001   gpa2002   gpa2003   gpa2004   gpa_a~10   gpa_a~11   gpa_a~12 |
     |-----------------------------------------------------------------------------------------------|
  1. |  1    1990       1.2       1.3       1.4       1.5       1.3        1.2        1.3        1.4 |
  2. |  2    1990       2.3       2.5       2.2       2.1       2.6        2.3        2.5        2.2 |
  3. |  3    1991       3.1       3.9       3.4       3.5         4        3.9        3.4        3.5 |
  4. |  4    1991       2.6       3.1       2.4       1.9       3.1        3.1        2.4        1.9 |
  5. |  5    1992       1.4       1.8       3.2       2.3       3.2        3.2        2.3        3.2 |
  6. |  6    1992       3.5         4         4         4       3.9          4          4        3.9 |
     +-----------------------------------------------------------------------------------------------+

I understand that there might not be a direct translation of this Stata code to R but what is the best way of replicating these results in R?

user3102806
  • 141
  • 6
  • 1
    Your Stata data set is nicely reproducible (with `input..end`), but since you want an answer with R, you should probably do the same for it. Here's a guide: http://stackoverflow.com/a/28481250/1191259 In my opinion, the best way to do this is to store your data in long format (id, byear, year, gpa), since var name parsing is an error-prone approach to coding and entirely unnecessary in R. If you do that, you could just add an age column = year - byear. – Frank Aug 04 '15 at 00:32

2 Answers2

3

You could reshape your data.frame to a form where each row represents a year for a student using the reshape2 package. Then calculating the age becomes trivial. Here is the complete code to accomplish this task assuming your data.frame from above is in a variable called dat:

mdat <- melt(dat, id.vars=c('id', 'byear'), value.name='gpa')
mdat %>%
    mutate(year=as.numeric(gsub('gpa', '', variable))) %>%
    select(id, byear, year, gpa) %>%
    mutate(age=year-byear)

Additionally, you can get the data.frame you requested by casting the melted data.frame:

dcast(mdat, id + byear ~ age, value.var='gpa')
> id byear  8    9    10   11   12   13   14
> 1  1990   NA   NA   1.2  1.3  1.4  1.5  1.3
> 2  1990   NA   NA   2.3  2.5  2.2  2.1  2.6
> 3  1991   NA   3.1  3.9  3.4  3.5  4.0   NA
> 4  1991   NA   2.6  3.1  2.4  1.9  3.1   NA
> 5  1992   1.4  1.8  3.2  2.3  3.2   NA   NA
> 6  1992   3.5  4.0  4.0  4.0  3.9   NA   NA
cr1msonB1ade
  • 1,716
  • 9
  • 14
  • 1
    Thank you (and Frank) very much for your nice suggestion. My reservation against reshaping the dataset stems from the fact that most of my projects involve millions of individuals that are followed over decades. The reshaping of datasets would take a very long time, I'm assuming, and it might be better to consider alternative approaches. – user3102806 Aug 04 '15 at 00:56
  • Assuming as in your example you would like to calculate these values for possibly missing values melting avoids including `NA` values which could vastly outweigh the cost of reshaping for very large data sets. Also if your data is already a data.frame, reshaping should be relatively inexpensive computationally. The `as.numeric(gsub` call is likely to be your computational bottleneck in large data sets. – cr1msonB1ade Aug 04 '15 at 01:01
2

I know the question has been excellently handled by the @cr1msonB1ade but to show the OP a nested for loop version in R to match posted Stata code:

for (i in 10:12) {
  for (j in 1990:1992) {
    gpadf[[paste0("gpa_age", i)]][gpadf$byear==j] <- 
              gpadf[[paste0("gpa", j+i)]][gpadf$byear==j]   
  }
}
Parfait
  • 104,375
  • 17
  • 94
  • 125