1

I have a first dataframe looking like that :

Department Municipality Location Lat. Long.
ANTIOQUIA MEDELLIN PALMITAS 6.343341 -75.69004
ANTIOQUIA MEDELLIN SANTA ELENA 6.209718 -75.50191
ANTIOQUIA MEDELLIN ALTAVISTA 6.223150 -75.62856

And a second dataframe :

Department_Name Municipality_Name
ANTIOQUIA MEDELLIN
ANTIOQUIA MEDELLIN

I'd like to merge the two data frames randomly, like that :

Department_Name Municipality_Name Location Lat Long.
ANTIOQUIA MEDELLIN SANTA ELENA 6.209718 -75.50191
ANTIOQUIA MEDELLIN PALMITAS 6.343341 -75.69004

Following this topic : Join data frames and select random row when there are multiple matches here's what i tried

library(dplyr)

df2<-subset(df2, select=c(Department_Name, Municipality_Name, Location,Long., Lat.))

df2 <- df2 %>% rename(Department = Department_Name, Municipality=Municipality_Name)

df1[df2, on = .(Department, Municipality, Location,Long., Lat.),
   {ri <- sample(.N, 1L)
   .(Department = Department[ri], Municipality = Municipality[ri])}, by = .EACHI]

Error in sample(.N, 1L) : object '.N' not found

My background in programming isn't good enough to understand the codes provided in this topic, so if someone can help with this error it'd great !

katdataecon
  • 185
  • 8

4 Answers4

1

A way using dplyr -

library(dplyr)

df2 %>%
  count(Department_Name, Municipality_Name) %>%
  left_join(df1, by = c('Department_Name' = 'Department', 
                        'Municipality_Name' = 'Municipality')) %>%
  group_by(Department_Name, Municipality_Name) %>%
  sample_n(first(n)) %>%
  ungroup
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
  • It worked with my first dataset, but when i try on another one (124k observations), i have the following error message : "Erreur : `size` must be less or equal than 1 (size of data), set `replace` = TRUE to use sampling with replacement.", it comes after sample_n(first(n)), do you have an idea of where it can comes from ? – katdataecon Jul 23 '21 at 15:41
  • 1
    I think that is because the number of rows in `df2` are more than the corresponding rows present in `df1`. You can add `replace = TRUE` in `sample_n`. – Ronak Shah Jul 24 '21 at 00:59
0

The second data frame is a proper subset of the first. Appending the second data frame provides no additional information to the first. Therefore, you can take a sample of the observations to achieve your goal:

iris[sample(x = nrow(iris), size = 5, replace = FALSE), ]
Connor Krenzer
  • 457
  • 2
  • 11
0
library(data.table)
df1_bis = data.table(df1)
df2_bis=data.table(df2) try <- data.frame(df1_bis[df2_bis, on = .(Department, Municipality),
           {ri <- sample(.N, 1L)
           .(Long. = Long.[ri], Lat. = Lat[ri])}, by = .EACHI])`
Alexander van Oostenrijk
  • 4,644
  • 3
  • 23
  • 37
katdataecon
  • 185
  • 8
0

If you are using dplyr, then relying on sample_n to get a random sample of the data frames, and left_join to do the merging might get code that is easier to interpret.

Here I provide slightly different data frame examples:

library(dplyr)

df_veredas <- #This is a sample dataframe with info for veredas
  data.frame(departamento = c("ANTIOQUIA", "ANTIOQUIA", "ANTIOQUIA", "CUNDINAMARCA", "CUNDINAMARCA", "CUNDINAMARCA"), 
             municipio = c("MEDELLIN", "MEDELLIN", "MEDELLIN", "GUADUAS", "GUADUAS", "GUADUAS"), 
             vereda = c("PALMITAS", "SANTA ELENA", "ALTAVISTA", "CEDRALES", "EL DIAMANTE", "CARRAPAL"),
             lat = c(6.343341, 6.209718, 6.22315, 5.05369106131653, 5.03379856537084, 5.26723603834365), 
             long = c(-75.69004, -75.50191, -74.6004510276649, -74.6904256, -74.5475269556119, -74.5892936214298))

df_municipios <- # This is a sample data frame with info for municipalities
  data.frame(Department_Name = c("ANTIOQUIA", "CUNDINAMARCA", "ATLÁNTICO"), 
             Municipality_Name = c("MEDELLIN", "GUADUAS", "BARRANQUILLA"),
             DIVIPOLA = c("05001", "25320", "08001" ))

# Below is where the sampling and merging happen.
sample_n(#This is where the sampling occurs. I get 2 random observations from df_veredas
tbl = df_veredas,  
         size = 2, 
         replace = FALSE) %>% 
  left_join(# This is where the merge happens
    df_municipios, #Merges the sampled df_veredas with df_municipios
    by = c("departamento" = "Department_Name", #Indexing by department, which is written differently in each table
           "municipio"    = "Municipality_Name" #And also indexing by municiplaity, which is also written differently in each table
))

  departamento municipio   vereda      lat      long DIVIPOLA
1 CUNDINAMARCA   GUADUAS CEDRALES 5.053691 -74.69043    25320
2 CUNDINAMARCA   GUADUAS CARRAPAL 5.267236 -74.58929    25320
Nicolás Velasquez
  • 5,623
  • 11
  • 22