0

I have panel data where I split the whole data set into multiple data frames by year and match unique keys across years. For example, if you have 6,000 observations in 2000 and 7000 observations in, I'm trying to match the overlap between each year for every year from 2000 to 2017.

I have a brute forced solution that's about 350 lines of copy and pasted code, but I'm looking for a more efficient and elegant solution using loops.

I'm working with for loops and looking into map() functions at the moment, but I haven't found a solution. I'm using R4DS.

#1989
b1989 <- b %>% filter(year == 1989) %>% select(key, V7, z9, z11, z13, z15)
a1990 <- a %>% select(key,year) %>% filter(year == 1990) %>% distinct()
br1989 <- inner_join(b1989, a1990, by = "key")

#1990
b1990 <- b %>% filter(year == 1990) %>% select(key, V7, z9, z11, z13, z15)
a1991 <- a %>% select(key,year) %>% filter(year == 1991) %>% distinct()
br1990 <- inner_join(b1990, a1991, by = "key")

#1991
b1991 <- b %>% filter(year == 1991) %>% select(key, V7, z9, z11, z13, z15)
a1992 <- a %>% select(key,year) %>% filter(year == 1992) %>% distinct()
br1991 <- inner_join(b1991, a1992, by = "key")

busrescount_t1 <- c(nrow(br1989),nrow(br1990),nrow(br1991))

busrescount_t1
[1] 4366 4956 4768

It currently works, but is simply bad code and cumbersome. Also, doing it at scale for 2-year, 3-year, 4-year differences in a nightmare and will be 1000+ lines of copy/pasted code.

The goal is to have a loop that produces a vector of these matches that can be placed into a data frame. I'm trying to do this for 20+ years.

  • 2
    Opyth43, welcome to SO! It's difficult to help without knowing *something* (well, many things) about the actual data. Please make this question *reproducible* by including sample *unambiguous* data (e.g., edit your question and add `data.frame(x=...,y=...)` or the output from `dput(head(x))`), and expected output (based on the sample input). Refs: https://stackoverflow.com/questions/5963269, https://stackoverflow.com/help/mcve, and https://stackoverflow.com/tags/r/info. – r2evans Jul 19 '19 at 15:26
  • Further, if all of your by-year frames are identically structured, instead of separating them, I suggest you either (1) keep them in one frame, using `dplyr::group_by` to perform calculations on them, or (2) at least keep them in a `list`-of-frames, where what you do to one frame you can easily do to all of them by using `lapply(list_of_frames, function(dfrm) ...)`, http://stackoverflow.com/a/24376207/3358272. – r2evans Jul 19 '19 at 15:27

1 Answers1

0

How about something like this? (I'd love to be able to verify this works using a sample of your data.)

In theory, we should be able to join b to a version of a where the year is shifted forward one. If the row in b has a match in a with the same key and the following year, the join should complete and have a TRUE in the a_match column.

b %>% 
  select(key, V7, z9, z11, z13, z15) %>%
  left_join(a %>% select(key, year) %>% 
               mutate(year = year + 1, a_match = TRUE),
            by = c("key", "year")) %>%
  filter(!is.na(a_match))
Jon Spring
  • 55,165
  • 4
  • 35
  • 53