Join data frames and select random row when there are multiple matches

Question

I have a reference data frame (df1) with three columns of "characteristics" (gender, year, code), and two columns of "values" (amount, status). It looks like this, but with many rows:

gender    year    code    amount   status
     M    2011       A        15      EMX
     M    2011       A       123      NOX
     F    2015       B         0      MIX
     F    2018       A        12      NOX
     F    2015       B        11      NOX

I have another data frame (df2) that just has the three "characteristics" columns. For example:

gender    year   code
     M    2011      A
     M    2011      A
     F    2018      A
     F    2015      B

For each row in df2, I want to assign "values" based on matches in "characteristics" to df1. Where there are multiple matches, I want to select pairs of "values" at random. So when there are duplicate "characteristics" in df2, they might end up with different pairs of "values", but all of them will have an exact match in df1. Essentially, for each combination of characteristics, I want the distribution of values to match between the two tables.

For example, the last row in 'df2' (gender = F, year = 2015, code = B) matches two rows in 'df1': the third row (amont = 0, status = MIX) and the fifth row (amount = 11, status = NOX). Then one of these matching rows should be selected randomly. For all such cases of multiple matches between 'df2' and 'df1' based on gender, year and code, a random row should be selected.

So far, my approach has been to start by using dplyr to do a left_join between the two data frames. However, this provides all possible "values" for each row in df2, rather than selecting one at random. So I then have to group by characteristics and select one. This produces a very large intermediate table and doesn't seem very efficient.

I wondered if anyone had suggestions for a more efficient method? I've previously found that joining with the data.table package is quicker, but don't really have a good understanding of the package. I also wonder if I should be doing joins at all, or should just use the sample function?

Any help much appreciated.

Are the 'characteristics' columns of both dataframes exactly the same? Or is there a reason to work with a df2? — Sven, Jun 12 '19 at 10:49
df2 is a separate cohort of people (each row is a person) - I want to generate a distribution of 'values' for them, based on the distribution of characteristics/values in df1. All combinations of characteristics in df2 will appear somewhere in df1. — rw2, Jun 12 '19 at 11:32
Yes, although it's quite unlikely, that could potentially happen. — rw2, Jun 12 '19 at 11:52

Henrik · Accepted Answer · 2019-06-12T15:29:54.863

6

Use 'd2' to lookup rows in 'd1' based on matches in 'gender', 'year', 'code' (d1[d2, on = .(gender, year, code), ...]). For each match (by = .EACHI), sample one row (sample(.N, 1L)). Use this to index 'amount' and 'status'.

d1[d2, on = .(gender, year, code),
  {ri <- sample(.N, 1L)
  .(amount = amount[ri], status = status[ri])}, by = .EACHI]

# sample based on set.seed(1)
#    gender year code amount status
# 1:      M 2011    A     15    EMX
# 2:      M 2011    A     15    EMX
# 3:      F 2018    A     12    NOX
# 4:      F 2015    B     11    NOX

Note that there is an open issue on Enhanced functionality of mult argument, i.e. how to handle cases when multiple rows in x match to the row in i. Currently, valid options are "all" (default), "first" or "last". But if/when the issue is implemented, mult = "random" (sample(.N, size = 1L)) may be used to select a random row (rows) among the matches.

edited Jun 12 '19 at 15:29

answered Jun 12 '19 at 12:31

Henrik

65,555
14
143
159

1

this is perfect. – s_baldur Jun 12 '19 at 12:55
This is great, and really quick - many thanks! One additional thing I'm having trouble figuring out... If I also have a person_ID column in df2 that I would like to retain in the results, how could I modify the code to keep that in? – rw2 Jun 12 '19 at 13:37
1

Although I didn't write it explicitly, the `.( )` part corresponds to the `j` slot in `data.table` terms. Then have a look at the `j` argument in `?data.table`: "As long as `j` returns a list [and `.()` is a shorthand alias to `list()`], each element of the list becomes a column in the resulting `data.table`". So if you want to return addtional columns from 'd2', just add them inside `.( )`, E.g. add a column named 'id': `.(id = id, amount = amount[ri], status = status[ri])`. – Henrik Jun 12 '19 at 13:47
In this context, it may also be worthwhile to familiarize yourself with prefix `i.` and prefix `x.` - search for them in the help text. Cheers. – Henrik Jun 12 '19 at 14:25
1

@rw2 Just to let you know: I added a pointer to a relevant open `data.table` issue. – Henrik Jun 12 '19 at 15:25
1

You could jumble df1's rows and take the first match: `df2[, c("amount", "status") := df1[sample(.N)][.SD, on=.(gender, year, code), mult="first", .(x.amount, x.status)]]` I guess this should be more efficient than something with `by=` – Frank Jun 12 '19 at 16:48
Does anybody know why I get this Error trying to execute the code above? Error in `[.data.frame`(d1, d2, on = .(gender, year, code), { : unused arguments (on = .(gender, year, code), by = .EACHI) – C. Toni Jul 27 '20 at 11:28
1

@C.Toni 'd1' needs to be a `data.table`. Begin with `setDT(d1)`. – Henrik Jul 27 '20 at 11:34

score 1 · Answer 2 · answered Jun 12 '19 at 11:34

My data.table game is pretty weak, but here's a potential solution using an approach similar to that you describe above. First, I define the data frames.

# Define data frames
df1 <- read.table(text= "gender    year    code    amount   status
M    2011       A        15      EMX
M    2011       A       123      NOX
F    2015       B         0      MIX
F    2018       A        12      NOX
F    2015       B        11      NOX", header = TRUE)

df2 <- read.table(text = "gender    year   code
     M    2011      A
     M    2011      A
     F    2018      A
     F    2015      B", header = TRUE)

Then, I set the random number generator seed for reproducibility and load the library.

# Set RNG seed
set.seed(4)

# Load library
library(data.table)

Next, I convert data frames to data tables.

# Convert to data tables
dt1 <- data.table(df1) 
dt2 <- data.table(df2)

Here, I do the actual joins, etc. I've done it 5 times in a loop to show the randomness of the results.

for(i in c(1:5)){
  # Add row numbers
  dt3 <- dt2[, rn :=.I
             ][dt1,on = .(gender, year, code)
               ][, .SD[sample(.N)[1]], .(gender, year, code, rn)
                 ][, rn := NULL]

  # Check results
  print(dt3)
}
#>    gender year code amount status
#> 1:      M 2011    A    123    NOX
#> 2:      M 2011    A     15    EMX
#> 3:      F 2015    B      0    MIX
#> 4:      F 2018    A     12    NOX
#>    gender year code amount status
#> 1:      M 2011    A    123    NOX
#> 2:      M 2011    A    123    NOX
#> 3:      F 2015    B     11    NOX
#> 4:      F 2018    A     12    NOX
#>    gender year code amount status
#> 1:      M 2011    A    123    NOX
#> 2:      M 2011    A    123    NOX
#> 3:      F 2015    B     11    NOX
#> 4:      F 2018    A     12    NOX
#>    gender year code amount status
#> 1:      M 2011    A     15    EMX
#> 2:      M 2011    A     15    EMX
#> 3:      F 2015    B     11    NOX
#> 4:      F 2018    A     12    NOX
#>    gender year code amount status
#> 1:      M 2011    A    123    NOX
#> 2:      M 2011    A     15    EMX
#> 3:      F 2015    B      0    MIX
#> 4:      F 2018    A     12    NOX

^{Created on 2019-06-12 by the reprex package (v0.3.0)}

What I actually do is add row numbers to the data table, which will help me pare down the final data table. I join the data tables and then group all of the rows that originated from a single row in dt2 and pull one at random using sample. (This bit of code is borrowed from @akrun here.) Finally, I drop the row number column.

Many thanks, this does work exactly as I wanted. I accepted a later answer as it seems to work quicker - I guess this is because with my actual data, the intermediate table after the join ends up being enormous. — rw2, Jun 12 '19 at 12:27
Nice one! Small efficiency comment: `sample(.N, 1)` should be much faster than `sample(.N)[1]` at least if `.N` is big. — s_baldur, Jun 12 '19 at 12:58

score 1 · Answer 3 · answered Jun 12 '19 at 11:58

df2 %>%
  mutate(
    amount = pmap_chr(
      .l = df2,
      .f = ~ df1 %>%
        filter(gender == ..1, year == ..2, code == ..3) %>%
        select(amount) %>%
        sample_n(1) %>%
        pull(amount)
    ),
    status = pmap_chr(
      .l = df2,
      .f = ~ df1 %>%
        filter(gender == ..1, year == ..2, code == ..3) %>%
        select(status) %>%
        sample_n(1) %>%
        pull(status)
    )
  )

This is pretty slow and I'd personally avoid it, but it's a way of doing it.

score 1 · Answer 4 · answered Jun 12 '19 at 11:58

I expect this would be efficient:

df1[, row := .I]
keys <- c("year", "gender", "code")
setkeyv(df1, keys)
setkeyv(df2, keys)

for (rowdf2 in seq_len(nrow(df2))) {
  set(df2, i = rowdf2, j = "rowindf1", value = df1[df2[rowdf2], x.row[sample(.N, 1)]])
}

setkeyv(df1, "row")
df1[df2[, .(rowindf1)]]

Example output:

#    gender year code amount status row
# 1:      M 2011    A    123    NOX   2
# 2:      M 2011    A     15    EMX   1
# 3:      F 2015    B     11    NOX   5
# 4:      F 2018    A     12    NOX   4

Join data frames and select random row when there are multiple matches

4 Answers4

Linked