0

I apologize if my question does not make sense, as I am unsure how to phrase it.

I have two dataframes. One has a county identifier with corresponding population for each county. The pseudo-data I provide is fixed in time, however the actual data shows different county population values for each month over several years.

The second data frame lists the neighboring county for each county. Some counties have multiple neighbors while others have none. The goal here is to find the average population of neighboring counties for each county.

I am struggling conceptually on how to manipulate the data correctly. I attempted using group_by time and county with summarize, however I get the same mean neighboring population result for each county irrespective of year/month.

The columns "neighbor_county_agg" and "mean_neighbor_county" are my desired output. I don't know if I need to use a for-loop here since some counties have multiple neighbors thus the "average population" cell may builds on itself with new neighbor matches. Moreover, the population varies by month and year.

#pseudo data

dput(head(df1))

#note neighbor_county_agg and mean_neighbor_county are my desired output

structure(list(county_ID = c("A", "B", "C", "D", "E", "F"), population = c(100, 
350, 200, 100, 50, 80), neighbor_county_agg = c("D, B", "A, F", 
"NA", "A, F", "NA", "D, B, G"), mean_neighbor_county = c("100 + 350 / 2", 
"100+80 / 2", "NA", "100 + 80 / 2", "NA", "100 + 350 + 50 /3"
)), row.names = c(NA, -6L), class = c("tbl_df", "tbl", "data.frame"
))

dput(head(df2))

structure(list(county_ID = c("A", "A", "B", "B", "C", "D"), neighbor_county = c("D", 
"B", "A", "F", "NA", "A")), row.names = c(NA, -6L), class = c("tbl_df", 
"tbl", "data.frame"))


#my attempt
df1 <- df1 %>% mutate(neighbor_population = df1$population[df1$neighbor_county_agg, df2$neighbor_county)]) 

#note: the pseudo data in my example does not have date
group_by(date, county_ID) %>% 
  summarise(mean_population=mean(neighbor_population),
            .groups = 'drop')
  • 2
    Welcome to SO, notyouraveragecat! Please do not post (only) an image of code/data/errors: it breaks screen-readers and it cannot be copied or searched (ref: https://meta.stackoverflow.com/a/285557 and https://xkcd.com/2116/). Please include the code, console output, or data (e.g., `data.frame(...)` or the output from `dput(head(x))`) directly. – r2evans Feb 13 '23 at 17:57
  • 1
    I suspect this can be trivially resolved by merging/joining (https://stackoverflow.com/q/1299871/3358272, https://stackoverflow.com/q/5706437/3358272) the two frames, then sum-by-group (https://stackoverflow.com/q/11562656/3358272) to get the sum of neighboring countries. – r2evans Feb 13 '23 at 17:57
  • Please provide enough code so others can better understand or reproduce the problem. – Community Feb 13 '23 at 18:01

1 Answers1

0

Another method:

library(dplyr)
library(tidyr) # unnest, only for data creation

Sample data:

df1 <- data.frame(county=LETTERS[1:7], popn = c(100,350,200,100,50,80,50))
df2 <- data.frame(county=LETTERS[1:7], neighbor = c("D,B","A,F",NA,"A,F",NA,"D,B,G","F")) %>%
  mutate(neighbor = strsplit(neighbor, ",")) %>%
  unnest(neighbor)

The summarization:

df1 %>%
  left_join(df2, by = "county") %>%
  left_join(df1, by = c(neighbor = "county")) %>%
  group_by(county) %>%
  summarize(
    popn = first(popn.x),
    popn_neighbors_avg = mean(popn.y)
  )
# # A tibble: 7 × 3
#   county  popn popn_neighbors_avg
#   <chr>  <dbl>              <dbl>
# 1 A        100               225 
# 2 B        350                90 
# 3 C        200                NA 
# 4 D        100                90 
# 5 E         50                NA 
# 6 F         80               167.
# 7 G         50                80 
r2evans
  • 141,215
  • 6
  • 77
  • 149
  • thank you so much r2evans! I actually implemented the first version of your answer and it worked. Apologies for not being clear, the original data does not come with the neighbor_county_agg column (that was included to show the output I would want). As a follow up, and perhaps I should add this to my original question, how can I approach this when working with multiple time periods? For example, the population for each county may differ by each month over the course of several years. Should I group_by year and month too here? – notyouraveragecat Feb 13 '23 at 19:41
  • Do you need me to reinstate (rollback) the previous version of the answer? That might be useful so that you can accept the version you prefer. – r2evans Feb 13 '23 at 19:41
  • Yes please. The previous version works better. – notyouraveragecat Feb 13 '23 at 19:53
  • I think that's the version you preferred, glad it worked for you. – r2evans Feb 13 '23 at 20:14