I have a reference data frame (df1) with three columns of "characteristics" (gender, year, code), and two columns of "values" (amount, status). It looks like this, but with many rows:
gender year code amount status
M 2011 A 15 EMX
M 2011 A 123 NOX
F 2015 B 0 MIX
F 2018 A 12 NOX
F 2015 B 11 NOX
I have another data frame (df2) that just has the three "characteristics" columns. For example:
gender year code
M 2011 A
M 2011 A
F 2018 A
F 2015 B
For each row in df2, I want to assign "values" based on matches in "characteristics" to df1. Where there are multiple matches, I want to select pairs of "values" at random. So when there are duplicate "characteristics" in df2, they might end up with different pairs of "values", but all of them will have an exact match in df1. Essentially, for each combination of characteristics, I want the distribution of values to match between the two tables.
For example, the last row in 'df2' (gender = F, year = 2015, code = B) matches two rows in 'df1': the third row (amont = 0, status = MIX) and the fifth row (amount = 11, status = NOX). Then one of these matching rows should be selected randomly. For all such cases of multiple matches between 'df2' and 'df1' based on gender, year and code, a random row should be selected.
So far, my approach has been to start by using dplyr
to do a left_join
between the two data frames. However, this provides all possible "values" for each row in df2, rather than selecting one at random. So I then have to group by characteristics and select one. This produces a very large intermediate table and doesn't seem very efficient.
I wondered if anyone had suggestions for a more efficient method? I've previously found that joining with the data.table
package is quicker, but don't really have a good understanding of the package. I also wonder if I should be doing joins at all, or should just use the sample
function?
Any help much appreciated.