EDIT - Improved the question by including a reproducible example and giving more clarity to my issues
Hi, my issue is that I have to translate this Stata code to R for it to be used in a large dataset:
sort UF UPA Ano Trimestre
loc j = 1
loc stop = 0
loc count = 0
while `stop' == 0 {
loc lastcount = `count'
count if p201 == . & n_p == `i'+1
loc count = r(N)
if `count' == `lastcount' {
loc stop = 1
}
else {
if r(N) != 0 {
replace p201 = p201[_n - `j'] if
UF == UF[_n - `j'] &
UPA == UPA[_n - `j'] &
n_p == `i'+1 & n_p[_n - `j'] == `i' &
p201 ==. & forw[_n - `j'] != 1 &
replace forw = 1 if UF == UF[_n + `j'] &
UPA == UPA[_n + `j'] &
p201 == p201[_n + `j'] &
n_p == `i' & n_p[_n + `j']==`i'+ 1 &
forw != 1
loc j = `j' + 1
}
else {
loc stop = 1
}
}
}
replace back = p201 !=. if n_p == `i'+1
replace forw = 0 if forw != 1 & n_p == `i'
}
My dataset is huge and more complex than the example posted below. I would like to understand mainly what is the usefulness of the while
loop involving j
.
Here is a toy example and the desired result in R
:
start <- data.frame(
Ano = c(2012, 2012, 2012, 2012),
Trimestre = c("1", "2", "3", "4"),
UF = c(28, 28, 28, 28),
UPA = c(280020150, 280020150, 280020150, 280020150),
n_p = c(1, 2, 3, 4),
p201 = c(1, NA, NA, NA),
back = c(NA, NA, NA, NA),
forw = c(NA, NA, NA, NA)
)
end <- data.frame(
Ano = c(2012, 2012, 2012, 2012),
Trimestre = c("1", "2", "3", "4"),
UF = c(28, 28, 28, 28),
UPA = c(280020150, 280020150, 280020150, 280020150),
n_p = c(1, 2, 3, 4),
p201 = c(1, 1, 1, 1),
back = c(NA, 1, 1, 1),
forw = c(1, 1, 1, 0)
)
Mainly, in the dataset there are multiple possible combinations for UF
, UPA
that identify the individual. Ano
and Trimestre
denote year and trimesters.
It seems as if the dataset is only matching all rows with the same UF-UPA
by having them all according to the first value of p201
in each group. Variables back
and forw
equal 1 if an observation is paired with some other one in a past or future date.
My question then is if someone can help me say what are the while
and j
's for? I am not sure if the code could be greatly simplified in R by only using group_by
from dplyr. I am not sure even if a for
loop would be required.
However, I am not sure if this is only because of the particular subset of the data I have posted here or if these parts are indeed necessary. Is there a clever way to find out by testing some other stuff?