Merge data based on several columns R

Question

I'd like to merge two dataset based on Permno columns. But one of the datasets has 5 Permno columns instead of 1. The following table is a simplified example of what the 2nd dataset looks like:

execid	exec-fname	Permno	Permno1	Permno2
124	William	17743
125	Warren		83443	17778
897	Chris	18092
393	James		75294	81774

This is an reproducible snippet of my data:

structure(list(permno = structure(c(17743, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 18092
), label = "Historical CRSP PERMNO Link to COMPUSTAT Record", format.stata = "%12.0g"), 
    permno1 = structure(c(NA, 83443, 83443, 83443, 83443, 83443, 
    83443, 83443, 83443, 83443, 17778, 17778, 17778, 17778, 17778, 
    17778, 17778, 17778, 17778, 17778, NA), label = "1 permno", format.stata = "%12.0g"), 
    permno2 = structure(c(NA, 17778, 17778, 17778, 17778, 17778, 
    17778, 17778, 17778, 17778, 83443, 83443, 83443, 83443, 83443, 
    83443, 83443, 83443, 83443, 83443, NA), label = "2 permno", format.stata = "%12.0g"), 
    permno3 = structure(c(NA_real_, NA_real_, NA_real_, NA_real_, 
    NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, 
    NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, 
    NA_real_, NA_real_, NA_real_, NA_real_, NA_real_), label = "3 permno", format.stata = "%12.0g"), 
    permno4 = structure(c(NA_real_, NA_real_, NA_real_, NA_real_, 
    NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, 
    NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, 
    NA_real_, NA_real_, NA_real_, NA_real_, NA_real_), label = "4 permno", format.stata = "%12.0g"), 
    execid = structure(c(124, 125, 125, 125, 125, 125, 125, 125, 
    125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 
    133), label = "(firstnm) execid", format.stata = "%10.0g"), 
    exec_lname = structure(c("Berkley", "Buffett", "Buffett", 
    "Buffett", "Buffett", "Buffett", "Buffett", "Buffett", "Buffett", 
    "Buffett", "Buffett", "Buffett", "Buffett", "Buffett", "Buffett", 
    "Buffett", "Buffett", "Buffett", "Buffett", "Buffett", "Miller"
    ), label = "(firstnm) exec_lname", format.stata = "%20s"), 
    exec_fname = structure(c("William", "Warren", "Warren", "Warren", 
    "Warren", "Warren", "Warren", "Warren", "Warren", "Warren", 
    "Warren", "Warren", "Warren", "Warren", "Warren", "Warren", 
    "Warren", "Warren", "Warren", "Warren", "Dane"), label = "(firstnm) exec_fname", format.stata = "%26s")), row.names = c(NA, 
-21L), class = c("tbl_df", "tbl", "data.frame"))

The 1st data set only has 1 Permno. How can I merge the two datasets based on the Permno and check every column of the 2nd data set for a match?

Is there at most 1 company ID value per row? If so, use something like `dplyr::coalesce` to collapse the values into a single column and then do a standard `dplyr:inner_join`. It's easier to help you if you include a simple [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input and desired output that can be used to test and verify possible solutions. Share data via a `dput()` so we can copy/paste rather than re-import. — MrFlick, Aug 01 '23 at 19:33
From this example, it seems as if the four company IDs can be reduced down to a single column. Is there a reason that those four columns are sparse like this? Are there any rows that contains more than one non-empty company id? — r2evans, Aug 01 '23 at 19:44
The company IDs are spars because they have changed/ got updated over the years. Sometimes a company could have more than one ID. — bear_525, Aug 07 '23 at 06:11
@MrFlick I just updated the question with a reproducible example. Thank you. — bear_525, Aug 07 '23 at 06:32

Mark · Answer 1 · 2023-08-08T05:28:52.827

To add some meat onto the Mr Flick's comment:

library(dplyr)

df1 <- data.frame(
    Company.ID = 12456,
    value = "goal"
)

df2 <- data.frame(
  IDs = c(124, 345, 675, 897, 235),
  People = c("Jim", "Mary", "Hilary", "Chris", "John."),
  Company.ID.1 = c(12456, 67843, NA, NA, NA),
  Company.ID.2 = c(NA, NA, 569024, NA, NA),
  Company.ID.3 = c(NA, NA, NA, 789461, NA),
  Company.ID.4 = c(NA, NA, NA, NA, 285013)
)
df2 |>
    mutate(IDs, People, Company.ID = coalesce(Company.ID.1, Company.ID.2, Company.ID.3, Company.ID.4), .keep = "none") |>
    inner_join(df1, by = "Company.ID")

# Output:
  IDs People Company.ID value
1 124    Jim      12456  goal

Update:

df1 <- structure(list(permno = structure(c(17743, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 18092
), label = "Historical CRSP PERMNO Link to COMPUSTAT Record", format.stata = "%12.0g"), 
    permno1 = structure(c(NA, 83443, 83443, 83443, 83443, 83443, 
    83443, 83443, 83443, 83443, 17778, 17778, 17778, 17778, 17778, 
    17778, 17778, 17778, 17778, 17778, NA), label = "1 permno", format.stata = "%12.0g"), 
    permno2 = structure(c(NA, 17778, 17778, 17778, 17778, 17778, 
    17778, 17778, 17778, 17778, 83443, 83443, 83443, 83443, 83443, 
    83443, 83443, 83443, 83443, 83443, NA), label = "2 permno", format.stata = "%12.0g"), 
    permno3 = structure(c(NA_real_, NA_real_, NA_real_, NA_real_, 
    NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, 
    NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, 
    NA_real_, NA_real_, NA_real_, NA_real_, NA_real_), label = "3 permno", format.stata = "%12.0g"), 
    permno4 = structure(c(NA_real_, NA_real_, NA_real_, NA_real_, 
    NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, 
    NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, 
    NA_real_, NA_real_, NA_real_, NA_real_, NA_real_), label = "4 permno", format.stata = "%12.0g"), 
    execid = structure(c(124, 125, 125, 125, 125, 125, 125, 125, 
    125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 125, 
    133), label = "(firstnm) execid", format.stata = "%10.0g"), 
    exec_lname = structure(c("Berkley", "Buffett", "Buffett", 
    "Buffett", "Buffett", "Buffett", "Buffett", "Buffett", "Buffett", 
    "Buffett", "Buffett", "Buffett", "Buffett", "Buffett", "Buffett", 
    "Buffett", "Buffett", "Buffett", "Buffett", "Buffett", "Miller"
    ), label = "(firstnm) exec_lname", format.stata = "%20s"), 
    exec_fname = structure(c("William", "Warren", "Warren", "Warren", 
    "Warren", "Warren", "Warren", "Warren", "Warren", "Warren", 
    "Warren", "Warren", "Warren", "Warren", "Warren", "Warren", 
    "Warren", "Warren", "Warren", "Warren", "Dane"), label = "(firstnm) exec_fname", format.stata = "%26s")), row.names = c(NA, 
-21L), class = c("tbl_df", "tbl", "data.frame"))

df2 <- tibble(
  IDs = c(124, 125, 897, 393),
  People = c("William", "Warren", "Chris", "James"),
  `Co ID` = c(17743, NA, 18092, NA),
  `Co ID 1` = c(NA, 83443, NA, 75294),
  `Co ID 2` = c(NA, 17778, NA, 81774),
  `Co ID 3` = c(NA, NA, NA, NA),
  `Co ID 4` = c(NA, NA, NA, NA)
)
df1 <- df1 |> 
  pivot_longer(
    cols = starts_with("permno"),
    names_to = "permno",
    values_to = "company_id"
  , values_drop_na = TRUE) |>
  select(-permno)
df2 <- df2 |>
 pivot_longer(
    cols = starts_with("Co ID"),
    names_to = "id_number",
    values_to = "company_id", 
    values_drop_na = TRUE) |>
    select(-id_number)

left_join(df2, df1, by = "company_id")

It's unsure if you want a left join, a right join, an inner join, or a full join. Play around with it, and see which works best

Thank you. I edited the question for clarity so perhaps it will help to clarify the need to merge per column versus using the `coalesce()` function. — bear_525, Aug 08 '23 at 05:12
In my real data they are called permno. Sorry I realized the mistake now rephrased the question. — bear_525, Aug 08 '23 at 16:11

Merge data based on several columns R

1 Answers1