Is there a command to merge data sets based on observations of a variable?

Question

I'm trying to merge two data sets, depending on the observations of the data sets.

In other words, I have two data sets both containing year and state. However, the two data sets consists each of one additional variable, X1 for df1 and X2 for df2. That is, I'm trying to merge the two dataset if a state is observed to a have minimum of 5 observation for both X1 and X2, such that the all observations for that state is included, even when both X1 and X2 are NA values.

Is there a way to merge the data sets to only merge states in which both X1 and X2 has minimum 5 observations? Such that, the new dataset has observations for all the years of the states that both X1 and X5 has a minimum of 5 observations in. While rest are excluded.

I have tried to use experiment with inner_join(df1,df2) with no success, as it only merges the year and state that both specific dataset has individual observations.

An reproducible example of the merge effect (for simplicity, I have used if 2 observations are non NA, the state is included)

df1 = read.table(
  text =
    "State Year X1
A 1 NA 
A 2 NA 
A 3 5 
A 4 NA 
B 1 NA 
B 2 NA 
B 3 4 
B 4 3", header = TRUE)

df2 = read.table(
  text =
    "State Year X2
A 1 NA 
A 2 5 
A 3 7 
A 4 NA 
B 1 NA 
B 2 2 
B 3 5 
B 4 7", header = TRUE)

newdf = read.table(
  text =
    "State Year X1 X2
B 1 NA NA 
B 2 NA 2
B 3 4 5
B 4 3 7", header = TRUE)

Here, newdf neglect the state A as the df1 only have one observation for that state, while all years are included for the state B (even the first year when both X1 and X2 are NA) as both X1 and X2 has minimum of 2 non-NA observations for that state. (recall, for simplicity here the minimum observation is 2 not 5)

Please share a [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) including a small example of your data used. — Martin Gal, Dec 28 '22 at 15:02
I will try to make a reproducible example and upload it to the original question, but it might be difficult to visualize. — Jens, Dec 28 '22 at 15:04
Updated the original text to include an example, if anything is still unclear, please dont hesitate to ask. — Jens, Dec 28 '22 at 15:10
Perhaps you need `dplyr`s `full_join` instead of an `inner_join`? `full_join(df1, df2, by = c("State", "Year"))`. — Martin Gal, Dec 28 '22 at 15:23
I've tried the full_join command aswell but there is no sortation to the minimum observations. The full_join merges all the observation for all years. — Jens, Dec 28 '22 at 15:26

score 2 · Accepted Answer · answered Dec 28 '22 at 15:25

2

You need to do further filtering after merging.

library(dplyr)

inner_join(df1, df2, by = c("State", "Year")) %>%
  group_by(State) %>%
  filter(if_all(X1:X2, ~ sum(!is.na(.x)) >= 2)) %>%
  ungroup()

# # A tibble: 4 × 4
#   State  Year X1    X2   
#   <chr> <int> <chr> <chr>
# 1 B         1 NA    NA   
# 2 B         2 NA    2    
# 3 B         3 4     5    
# 4 B         4 3     7

answered Dec 28 '22 at 15:25

Darren Tsai

32,117
5
21
51

Your coding of sortation did the trick. If anyone else has similar struggles instead of using the inner_join code, use the full_join as to merge all observations and then sort afterwards. – Jens Dec 28 '22 at 15:52

Is there a command to merge data sets based on observations of a variable?

1 Answers1