Merging two data frames without repetition using unique ID's

Question

I want to merge two data frames. The first one has less rows (2354), one unique ID, and two repeated ID's (CVE_MUN and CVE_ENT), these two are geographical identifiers.

# A tibble: 2,354 x 6
   CLAVE   NOMBRE       CVE_ENT `ENTIDAD FEDERATIVA~ CVE_MUN `MUNICIPIO (INEG~
   <chr>   <chr>          <dbl> <chr>                  <dbl> <chr>            
 1 03AGN-~ Aguascalien~       1 AGUASCALIENTES             1 AGUASCALIENTES   
 2 03AGO-~ Agostaderito       1 AGUASCALIENTES             1 AGUASCALIENTES   
 3 03AGP-~ Aguascalien~       1 AGUASCALIENTES             1 AGUASCALIENTES   
 4 03AGS-~ Aguascalien~       1 AGUASCALIENTES             1 AGUASCALIENTES   
 5 03AIS-~ Alfa Nissan        1 AGUASCALIENTES             1 AGUASCALIENTES   
 6 03ALF-~ Nissan Mexi~       1 AGUASCALIENTES             1 AGUASCALIENTES   
 7 03APT-~ Aguascalien~       1 AGUASCALIENTES             1 AGUASCALIENTES   
 8 03ASS-~ Aguascalien~       1 AGUASCALIENTES             1 AGUASCALIENTES   
 9 03ASU-~ Asuncion           1 AGUASCALIENTES             1 AGUASCALIENTES   
10 03AYU-~ Ayuntamiento       1 AGUASCALIENTES             1 AGUASCALIENTES

So I have 26 different ID's (CLAVE) in a certain area (CVE_ENT:1, CVE_MUN: 1), 3 ID's in an other area (CVE_ENT:1, CVE_MUN:2), etc

CVE_ENT    CVE_MUN    Freq
 1            1        26
 1            2        3
 1            3        2
 1            5        3
 1            6        1
 1            7        1
 1            9        1
 1           10        2
 1           11        2

I want to map all the ID's, so I am trying to merge it to a very detailed geographical data frame. This data frame has 299615 rows, each "MAPA" row has a unique long,lat coordinates.

      MAPA CVE_ENT        NOM_ENT NOM_ABR CVE_MUN        NOM_MUN CVE_LOC
1  10010001       1 AGUASCALIENTES    AGS.       1 AGUASCALIENTES       1
2  10010094       1 AGUASCALIENTES    AGS.       1 AGUASCALIENTES      94
3  10010096       1 AGUASCALIENTES    AGS.       1 AGUASCALIENTES      96
4  10010100       1 AGUASCALIENTES    AGS.       1 AGUASCALIENTES     100
5  10010102       1 AGUASCALIENTES    AGS.       1 AGUASCALIENTES     102
6  10010104       1 AGUASCALIENTES    AGS.       1 AGUASCALIENTES     104
7  10010106       1 AGUASCALIENTES    AGS.       1 AGUASCALIENTES     106
8  10010112       1 AGUASCALIENTES    AGS.       1 AGUASCALIENTES     112
9  10010113       1 AGUASCALIENTES    AGS.       1 AGUASCALIENTES     113
10 10010120       1 AGUASCALIENTES    AGS.       1 AGUASCALIENTES     120
11 10010121       1 AGUASCALIENTES    AGS.       1 AGUASCALIENTES     121
12 10010125       1 AGUASCALIENTES    AGS.       1 AGUASCALIENTES     125
13 10010126       1 AGUASCALIENTES    AGS.       1 AGUASCALIENTES     126
14 10010127       1 AGUASCALIENTES    AGS.       1 AGUASCALIENTES     127
15 10010128       1 AGUASCALIENTES    AGS.       1 AGUASCALIENTES     128
16 10010135       1 AGUASCALIENTES    AGS.       1 AGUASCALIENTES     135
17 10010138       1 AGUASCALIENTES    AGS.       1 AGUASCALIENTES     138
18 10010139       1 AGUASCALIENTES    AGS.       1 AGUASCALIENTES     139
19 10010141       1 AGUASCALIENTES    AGS.       1 AGUASCALIENTES     141
20 10010144       1 AGUASCALIENTES    AGS.       1 AGUASCALIENTES     144

Again, I have CVE_ENT and CVE_MUN but this time they are repeated many more times than with my first data frame.

CVE_ENT    CVE_MUN    Freq
 1           1         725
 1           2         242
 1           3         293
 1           4         94

What I want is to merge df1 given CVE_ENT and CVE_MUN with df2 without repeating MAPA. That is, that "CLAVE" gets merge to a different "MAPA" given CVE_ENT and CVE_MUN. Ideally, df1 will merge the first 26 CVE_ENT=1 and CVE_MUN=1 to the first 26 CVE_ENT=1 and CVE_MUN=1 of df2.

So far I have tried left_join and filtering by "CLAVE" but I'm getting "MAPA" repeated so again, I don't have a unique geographical long, lat.

Coord<-left_join(df1,df2,by=c("CVE_ENT","CVE_MUN"))
DT <- Coord[!duplicated(Coord$CLAVE),]

My ideal final data frame looks like this:

   CLAVE   NOMBRE       CVE_ENT `ENTIDAD FEDERATIVA~ CVE_MUN `MUNICIPIO (INEG~ MAPA
               
 1 03AGN-~ Aguascalien~       1 AGUASCALIENTES             1 AGUASCALIENTES 10010001
 2 03AGO-~ Agostaderito       1 AGUASCALIENTES             1 AGUASCALIENTES 10010094  
 3 03AGP-~ Aguascalien~       1 AGUASCALIENTES             1 AGUASCALIENTES 10010096  
 4 03AGS-~ Aguascalien~       1 AGUASCALIENTES             1 AGUASCALIENTES 10010100  
 5 03AIS-~ Alfa Nissan        1 AGUASCALIENTES             1 AGUASCALIENTES 10010102  
 6 03ALF-~ Nissan Mexi~       1 AGUASCALIENTES             1 AGUASCALIENTES 10010104  
 7 03APT-~ Aguascalien~       1 AGUASCALIENTES             1 AGUASCALIENTES 10010106  
 8 03ASS-~ Aguascalien~       1 AGUASCALIENTES             1 AGUASCALIENTES 10010112  
 9 03ASU-~ Asuncion           1 AGUASCALIENTES             1 AGUASCALIENTES 10010113  
10 03AYU-~ Ayuntamiento       1 AGUASCALIENTES             1 AGUASCALIENTES 10010120

It would help if you provided data that can be loaded directly in R, such as a [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example/5963610#5963610). — Alexlok, Aug 27 '20 at 00:30

Alexlok · Accepted Answer · 2020-08-27T22:41:21.810

You can do that in two steps:

find matching subsets of df1 and df2 (what a join typically does)
bind them with cbind() or bind_cols(), since you actually don't want to do a join

To keep the matching subsets together, it's practical to use nesting.

library(tidyverse)
# Dummy data
df1 <- tibble(uniq_id = letters[1:10],
              id1 = rep(1:3, times = c(5,3,2)),
              id2 = rep(1:5, each = 2))

df2 <- tibble(id1 = rep(1:3, each = 15),
              id2 = rep(1:5, times = 9),
              coord = paste("coord",id1, id2, c(LETTERS, LETTERS[1:19]), sep="-"))

df1 %>%
  nest_join(df2, by = c("id1", "id2")) %>%
  nest(df1 = uniq_id) %>%
  rowwise() %>%
  mutate(cbound = list(bind_cols(df1, df2[1:nrow(df1), ]))) %>%
  select(-df1, -df2) %>%
  unnest(cbound)

But I do have doubts about why doing that: normally a join is well-defined since you have a particular criterion to match rows in df1 to df2. Here, why should AGN have MAPA=10010001 and AGO have MAPA=10010094, and not the opposite?

Base R solution

You can obtain the same result using a base R solution:

# Dummy data
df1 <- data.frame(uniq_id = letters[1:10],
              id1 = rep(1:3, times = c(5,3,2)),
              id2 = rep(1:5, each = 2))

df2 <- data.frame(id1 = rep(1:3, each = 15),
              id2 = rep(1:5, times = 9))
df2$coord <- paste("coord",df2$id1, df2$id2, c(LETTERS, LETTERS[1:19]), sep="-")

parallel_merge <- function(xy){
  x <- xy[1]
  y <- xy[2]
  df1_rows <- which(df1$id1 == x & df1$id2 == y)
  
  if(length(df1_rows) == 0) return(NULL)

  df2_rows <- which(df2$id1 == x & df2$id2 == y)
  df2_rows <- df2_rows[1:length(df1_rows)]
  
  cbind(df1[df1_rows,],
        df2[df2_rows,])
}

combinations <- expand.grid(unique(df1$id1), unique(df2$id2))

coords_by_comb <- apply(combinations, 1, parallel_merge)
do.call(rbind, coords_by_comb)

The logic is slightly different: we first use expand.grid() to generate every combination of id1 and id2, then we can use apply() to loop on these combinations (note that if it some combinations do not exist in df1, this is not an optimal solution). For each combination, in the function parallel_merge(), we find the subset of df1 to keep, the subset of df2 to keep and truncate it to match df1. Then we just need to assemble everything.

I am trying to run it with the dummy df's and with my own and I keep getting this error: `Error in 1:nrow(df1) : argument of length 0` — Lilia, Aug 27 '20 at 21:52
Try running parts of the pipe to find which step creates the error: I imagine this happens at the `mutate` step? Does it happen with the dummy data I included in my answer? Else, can you edit your question to provide a minimal dataset that reproduces that error? Does it still happen in a newly started R session? (I edited my answer to include the library call)? — Alexlok, Aug 27 '20 at 22:03
yes, it is in the `mutate` step and it happens with the dummy data you included in your answer. I did restart my R session, called tidyverse and I am getting the same result. I am wondering if it is because it is a tibble and not a data frame — Lilia, Aug 27 '20 at 22:12
Can you run `sessionInfo()` and report the version of `R`, `dplyr` and `tidyr`? — Alexlok, Aug 27 '20 at 22:14
Strange, it works on my computer with the same versions (except dplyr 1.0.2). I edited my answer to include a base R solution, see if that works. — Alexlok, Aug 27 '20 at 22:42
Thank you, it worked perfect. I am doing this arbitrary join since I know these unique ID's are in a same region but at different geographical points. So although it is arbitrary, I know for sure these are within a same region — Lilia, Aug 27 '20 at 22:58

Merging two data frames without repetition using unique ID's

1 Answers1

Base R solution