Random merging between 2 dataframes R

Question

I have a first dataframe looking like that :

Department	Municipality	Location	Lat.	Long.
ANTIOQUIA	MEDELLIN	PALMITAS	6.343341	-75.69004
ANTIOQUIA	MEDELLIN	SANTA ELENA	6.209718	-75.50191
ANTIOQUIA	MEDELLIN	ALTAVISTA	6.223150	-75.62856

And a second dataframe :

Department_Name	Municipality_Name
ANTIOQUIA	MEDELLIN
ANTIOQUIA	MEDELLIN

I'd like to merge the two data frames randomly, like that :

Department_Name	Municipality_Name	Location	Lat	Long.
ANTIOQUIA	MEDELLIN	SANTA ELENA	6.209718	-75.50191
ANTIOQUIA	MEDELLIN	PALMITAS	6.343341	-75.69004

Following this topic : Join data frames and select random row when there are multiple matches here's what i tried

library(dplyr)

df2<-subset(df2, select=c(Department_Name, Municipality_Name, Location,Long., Lat.))

df2 <- df2 %>% rename(Department = Department_Name, Municipality=Municipality_Name)

df1[df2, on = .(Department, Municipality, Location,Long., Lat.),
   {ri <- sample(.N, 1L)
   .(Department = Department[ri], Municipality = Municipality[ri])}, by = .EACHI]

Error in sample(.N, 1L) : object '.N' not found

My background in programming isn't good enough to understand the codes provided in this topic, so if someone can help with this error it'd great !

score 1 · Answer 1 · answered Jul 23 '21 at 03:43

1

A way using dplyr -

library(dplyr)

df2 %>%
  count(Department_Name, Municipality_Name) %>%
  left_join(df1, by = c('Department_Name' = 'Department', 
                        'Municipality_Name' = 'Municipality')) %>%
  group_by(Department_Name, Municipality_Name) %>%
  sample_n(first(n)) %>%
  ungroup

answered Jul 23 '21 at 03:43

Ronak Shah

377,200
20
156
213

It worked with my first dataset, but when i try on another one (124k observations), i have the following error message : "Erreur : `size` must be less or equal than 1 (size of data), set `replace` = TRUE to use sampling with replacement.", it comes after sample_n(first(n)), do you have an idea of where it can comes from ? – katdataecon Jul 23 '21 at 15:41
1

I think that is because the number of rows in `df2` are more than the corresponding rows present in `df1`. You can add `replace = TRUE` in `sample_n`. – Ronak Shah Jul 24 '21 at 00:59

score 0 · Answer 2 · answered Jul 22 '21 at 23:46

The second data frame is a proper subset of the first. Appending the second data frame provides no additional information to the first. Therefore, you can take a sample of the observations to achieve your goal:

iris[sample(x = nrow(iris), size = 5, replace = FALSE), ]

score 0 · Answer 3 · edited Jul 23 '21 at 07:22

0

library(data.table)
df1_bis = data.table(df1)
df2_bis=data.table(df2) try <- data.frame(df1_bis[df2_bis, on = .(Department, Municipality),
           {ri <- sample(.N, 1L)
           .(Long. = Long.[ri], Lat. = Lat[ri])}, by = .EACHI])`

edited Jul 23 '21 at 07:22

Alexander van Oostenrijk

4,644
3
23
37

answered Jul 23 '21 at 00:03

katdataecon

185
8

score 0 · Answer 4 · answered Jul 23 '21 at 00:08

If you are using dplyr, then relying on sample_n to get a random sample of the data frames, and left_join to do the merging might get code that is easier to interpret.

Here I provide slightly different data frame examples:

library(dplyr)

df_veredas <- #This is a sample dataframe with info for veredas
  data.frame(departamento = c("ANTIOQUIA", "ANTIOQUIA", "ANTIOQUIA", "CUNDINAMARCA", "CUNDINAMARCA", "CUNDINAMARCA"), 
             municipio = c("MEDELLIN", "MEDELLIN", "MEDELLIN", "GUADUAS", "GUADUAS", "GUADUAS"), 
             vereda = c("PALMITAS", "SANTA ELENA", "ALTAVISTA", "CEDRALES", "EL DIAMANTE", "CARRAPAL"),
             lat = c(6.343341, 6.209718, 6.22315, 5.05369106131653, 5.03379856537084, 5.26723603834365), 
             long = c(-75.69004, -75.50191, -74.6004510276649, -74.6904256, -74.5475269556119, -74.5892936214298))

df_municipios <- # This is a sample data frame with info for municipalities
  data.frame(Department_Name = c("ANTIOQUIA", "CUNDINAMARCA", "ATLÁNTICO"), 
             Municipality_Name = c("MEDELLIN", "GUADUAS", "BARRANQUILLA"),
             DIVIPOLA = c("05001", "25320", "08001" ))

# Below is where the sampling and merging happen.
sample_n(#This is where the sampling occurs. I get 2 random observations from df_veredas
tbl = df_veredas,  
         size = 2, 
         replace = FALSE) %>% 
  left_join(# This is where the merge happens
    df_municipios, #Merges the sampled df_veredas with df_municipios
    by = c("departamento" = "Department_Name", #Indexing by department, which is written differently in each table
           "municipio"    = "Municipality_Name" #And also indexing by municiplaity, which is also written differently in each table
))

  departamento municipio   vereda      lat      long DIVIPOLA
1 CUNDINAMARCA   GUADUAS CEDRALES 5.053691 -74.69043    25320
2 CUNDINAMARCA   GUADUAS CARRAPAL 5.267236 -74.58929    25320

Random merging between 2 dataframes R

4 Answers4