I have the following dataset (sample). Obviously, the real data set that I am using is much larger:
gvkey tic stko year
001689 AEP1 1 2011
017096 BA3 1 2011
001440 AEP 0 2011
002285 BA 0 2011
001689 AEP1 1 2012
017096 BA3 1 2012
001440 AEP 0 2012
002285 BA 0 2012
Here is the code to generate the data:
dat <- data.frame(gvkey=c("001689", "017096", "001440", "002285"), tic=c("AEP1", "BA3", "AEP", "BA"), stko=c(1, 1, 0, 0), year=c(2011,2011,2011,2011,2012,2012,2012,2012))
Here is what I would like to do: Each row represents a year-firm pair with tic
being the firm ticker. Firms where stko
equals 1
are subsidiaries and share the same tic
as their parent company, but have a number attached to the ticker, e.g. AEP1
belongs to AEP
. Basically, I would like to create a new variable parent
, which indicates for each subsidiary (row with stko=1
) the gvkey
of the parent company. And I would like to do that for each year. The final data set should look like this:
gvkey tic stko year parent
001689 AEP1 1 2011 001440
017096 BA3 1 2011 002285
001440 AEP 0 2011
002285 BA 0 2011
001689 AEP1 1 2012 001440
017096 BA3 1 2012 002285
001440 AEP 0 2012
002285 BA 0 2012
Now, my initial approach would be to write a couple of for
loops that for a given year iterate over the rows. Whenever, stko=1
, then extract the part of the ticker without the number at the end (e.g. for row one AEP
) and find the row with this exact ticker in the given year (e.g. row 3 for year 2011) and copy the gvkey
of that row to the initial observation with stko=1
.
However, this procedure will be really slow given the size of my data set. I would be grateful if anyone can think of a faster and easier approach.
Many thanks!!
Using my main data set, the output for dput(droplevels(head(dat)))
is:
structure(list(gvkey = c("176017", "128663", "61586", "278120",
"14062", "285313"), datadate = structure(c(4L, 4L, 1L, 3L, 2L,
1L), .Label = c("31dec2010", "31dec2011", "31dec2012", "31dec2013"
), class = "factor"), fyear = c(2013, 2013, 2010, 2012, 2011,
2010), indfmt = structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label = "INDL", class = "factor"),
consol = structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label = "C", class = "factor"),
popsrc = structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label = "D", class = "factor"),
datafmt = structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label = "STD", class = "factor"),
tic = c("ATHX", "SQNM", "IMH", "FNLIF", "CCDBF", "BSBR"),
cusip = structure(c(1L, 6L, 5L, 4L, 3L, 2L), .Label = c("04744L106",
"05967A107", "124900309", "33564P103", "45254P508", "817337405"
), class = "factor"), conm = structure(c(1L, 6L, 5L, 4L,
3L, 2L), .Label = c("ATHERSYS INC", "BANCO SANTANDER BRASIL -ADR",
"CCL INDUSTRIES -CL B", "FIRST NATIONAL FINL CORP", "IMPAC MORTGAGE HOLDINGS INC",
"SEQUENOM INC"), class = "factor"), curcd = structure(c(2L,
2L, 2L, 1L, 1L, 2L), .Label = c("CAD", "USD"), class = "factor"),
costat = structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label = "A", class = "factor"),
stko = c(0, 0, 0, 0, 0, 0)), .Names = c("gvkey", "datadate",
"fyear", "indfmt", "consol", "popsrc", "datafmt", "tic", "cusip",
"conm", "curcd", "costat", "stko"), row.names = c(NA, 6L), class = "data.frame")