I do have a dataset consisting of surnames, years and values y. My aim is to analyze whether the value y is dependent on the corresponding value y of the previous generation. Unfortunately, I do not have a value y for each surname in each generation.
As an example dataset, you can take the following:
set.seed(700)
df_1 <- data.frame(year = c(1700, 1700, 1700, 1700, 1730, 1730, 1730, 1730, 1760, 1760, 1760, 1760, 1790, 1790, 1790, 1790, 1820, 1820, 1820, 1820), generation = c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4, 5, 5, 5, 5), surname = c("Miller", "NA", "Smith", "Garcia", "Miller", "Jordan", "Smith", "Garcia", "Miller", "Jordan", "NA", "Garcia", "Miller", "Jordan", "Smith", "NA", "NA", "Jordan", "Smith", "Garcia"), y=runif(20))
I run the following regression:
fitted_models = df_1 %>% group_by(surname) %>% do(model = lm(y ~ lag(y, n=1, order_by = year), data = df_1))
Now, I have three related questions:
(1) How can I take into account non-group-specific effects (such as generation specific fixed-effects)?
(2) How should I treat the NA-values?
(3) Does that regression take into account all observations with the respective observation of the previous generation or only the comparison between the first and the second generation?