I'm working with a panel data set with the following variables. Here's a snippet of my data:
i region urban year
8431 3 1 1979
8431 3 1 1980
8431 3 1 1981
8431 3 1 1982
8431 3 0 1983
8431 3 0 1984
8431 3 0 1985
8431 3 0 1986
8431 3 0 1987
8431 3 0 1988
8431 3 0 1989
8431 3 1 1990
8431 3 1 1991
8431 3 1 1992
8431 3 1 1993
8431 3 1 1994
8431 3 1 1996
8431 3 0 1998
8431 3 1 2000
8431 3 1 2002
8431 3 1 2004
8431 3 1 2006
8431 2 0 2008
8431 3 1 2010
8431 3 1 2012
The panel is strongly balanced, but there are some missing observations in the data that I deal with through interpolation.
I want to calculate the total number of times region
and urban
have changed during the observation period for each individual.
First thing I tried was lagged values:
bysort i: gen urban_lag = l1.urban
bysort i (year): gen urbanchange = 0 if urban==urban_lag & !missing(urban)
replace urbanchange = 1 if urban!=urban_lag & !missing(urban)
For some reason, this generates a lot of missing values, even for individuals where there are not missing values for region
. Does anyone know why this is the case?
I then tried this:
sort i year
by i: gen byte urbanchange = urban != urban[_n-1]
This mostly works and gets me slightly closer to where I want, except the first observation of each individual is equal to 1, because change[0]
is before the start of the data and so Stata returns it as missing, so I get urbanchange[1] = 1
for every individual.
This is the output I want:
i region urban year regionchange urbanchange
8431 3 1 1979 0 0
8431 3 1 1980 0 0
8431 3 1 1981 0 0
8431 3 1 1982 0 0
8431 3 0 1983 0 1
8431 3 0 1984 0 0
8431 3 0 1985 0 0
8431 3 0 1986 0 0
8431 3 0 1987 0 0
8431 3 0 1988 0 0
8431 3 0 1989 0 0
8431 3 1 1990 0 1
8431 3 1 1991 0 0
8431 3 1 1992 0 0
8431 3 1 1993 0 0
8431 3 1 1994 0 0
8431 3 1 1996 0 0
8431 3 0 1998 0 1
8431 3 1 2000 0 1
8431 3 1 2002 0 0
8431 3 1 2004 0 0
8431 3 1 2006 0 0
8431 2 0 2008 1 1
8431 3 1 2010 1 1
8431 3 1 2012 0 0
This question is basically the Stata version of this question I asked a year ago Counting the number of changes of a categorical variable during repeated measurements within a category
Edit regarding interpolation: since there are a lot of gaps in my data, for observations with missing values, I make the assumption that, for example, the individual's current urban value is the same as the last observed (non-missing) urban value.
For example:
region urban year
1 1 1979
1 1 1980
1 1 1981
3 1 1982
3 1 1983
3 1 1984
. . 1985
1 1 1986
1 1 1987
1 1 1988
1 1 1989
. . 1990
. . 1991
. . 1992
. . 1993
. . 1994
I fill in the missing values like such:
gen urban1 = urban
bysort i (year): replace urban1 = urban1[_n-1] if missing(urban1)
gsort i -year
replace urban1 = urban1[_n-1] if urban1 >=.
and do the same for region.
Output:
region urban year region1 urban1
1 1 1979 1 1
1 1 1980 1 1
1 1 1981 1 1
3 1 1982 3 1
3 1 1983 3 1
3 1 1984 3 1
. . 1985 3 1
1 1 1986 1 1
1 1 1987 1 1
1 1 1988 1 1
1 1 1989 1 1
. . 1990 1 1
. . 1991 1 1
. . 1992 1 1
. . 1993 1 1
. . 1994 1 1
I'm making some strong assumptions here and some may disagree with what I'm doing here, but for the purpose of counting the number of changes, it doesn't affect the results I want and the missing values don't mess with my output.