I apologize if my question does not make sense, as I am unsure how to phrase it.
I have two dataframes. One has a county identifier with corresponding population for each county. The pseudo-data I provide is fixed in time, however the actual data shows different county population values for each month over several years.
The second data frame lists the neighboring county for each county. Some counties have multiple neighbors while others have none. The goal here is to find the average population of neighboring counties for each county.
I am struggling conceptually on how to manipulate the data correctly. I attempted using group_by time and county with summarize, however I get the same mean neighboring population result for each county irrespective of year/month.
The columns "neighbor_county_agg" and "mean_neighbor_county" are my desired output. I don't know if I need to use a for-loop here since some counties have multiple neighbors thus the "average population" cell may builds on itself with new neighbor matches. Moreover, the population varies by month and year.
#pseudo data
dput(head(df1))
#note neighbor_county_agg and mean_neighbor_county are my desired output
structure(list(county_ID = c("A", "B", "C", "D", "E", "F"), population = c(100,
350, 200, 100, 50, 80), neighbor_county_agg = c("D, B", "A, F",
"NA", "A, F", "NA", "D, B, G"), mean_neighbor_county = c("100 + 350 / 2",
"100+80 / 2", "NA", "100 + 80 / 2", "NA", "100 + 350 + 50 /3"
)), row.names = c(NA, -6L), class = c("tbl_df", "tbl", "data.frame"
))
dput(head(df2))
structure(list(county_ID = c("A", "A", "B", "B", "C", "D"), neighbor_county = c("D",
"B", "A", "F", "NA", "A")), row.names = c(NA, -6L), class = c("tbl_df",
"tbl", "data.frame"))
#my attempt
df1 <- df1 %>% mutate(neighbor_population = df1$population[df1$neighbor_county_agg, df2$neighbor_county)])
#note: the pseudo data in my example does not have date
group_by(date, county_ID) %>%
summarise(mean_population=mean(neighbor_population),
.groups = 'drop')