How do I match across large data sets using names, gender, race, and a rough estimate of age in R

Question

I have 2 data sets each around 20k rows.

df 1 contains the following information

df2 contains the following

first name | last name | race | sex | age

I would like to join the data sets so that I get a unique ID for each row relevant row in DF 2. The issue is since this data is longitudinal some people are in df2 several times and have multiple ages. for example

John | smith | white | male | 29
John | smith | white | male | 30

whereas df1 contains the following

John | smith | white | male | 1991 | 74b23

ultimately I would like the data to look like this


John | smith | white | male | 29 | 74b23
John | smith | white | male | 30 | 74b23

is there a way to return exact matches on the first name, last name, race, and gender but match on age give or take a year?

Thanks!

Welcome to SO, sd3184! Please provide sample data for both as well as some form of expected output. The gold-standard is to paste the output from `dput(x)` into a code-block, where `x` is a small, representative sample of the frame; it might be just 10 rows and the necessary columns, removing ones we don't need. It is essential that the two frames show some matches; it is helpful if there is at least one row in each that does not match, so that you can demonstrate what should happen with those rows. Refs: https://stackoverflow.com/q/5963269, [mcve], and https://stackoverflow.com/tags/r/info — r2evans, Jul 02 '21 at 13:32

koolmees · Answer 1 · 2021-07-02T15:03:52.943

1

library(data.table)

df1 <- data.table(f.name = "John",
                      l.name = "Smith",
                      race = "white",
                      gender = "male",
                      b.year = 1991,
                      ID = "74b23")    

df2 <- data.table(f.name = "John",
                      l.name = "Smith",
                      race = "white",
                      gender = "male",
                      age = c(29, 30))

df1[, age := year(Sys.Date())-b.year]

setkeyv(df1, c("f.name", "l.name", "race", "gender", "age"))
setkeyv(df2, c("f.name", "l.name", "race", "gender", "age"))
df3 <- df1[df2, roll = "nearest"]

The "roll" value uses the last column provided in setkeyv, in this case age.

> df3
   f.name l.name  race gender b.year    ID age
1:   John  Smith white   male   1991 74b23  29
2:   John  Smith white   male   1991 74b23  30

edited Jul 02 '21 at 15:03

answered Jul 02 '21 at 13:38

koolmees

2,725
9
23

Thanks, for some reason this removes one of the John Smith's. I only, for example, see 1: John Smith white male 1991 74b23 29 and not the version where John smith is 30 – sd3184 Jul 02 '21 at 14:23
To clarify, I want to maintain all of the names in df2. – sd3184 Jul 02 '21 at 14:25
@sd3184 What is your tolerance threshold for "nearness" in age? This could be achieved using a [non-equi `join`](https://www.r-bloggers.com/2021/02/the-unequalled-joy-of-non-equi-joins/#non-equi-joins-with-data.table) with `data.table`, in which the `abs()` of the difference in `age`s is less than (or equal to) that threshold... – Greg Jul 02 '21 at 14:31
Did you copy the entire code? As this is not expected behavior. You can try with roll = -1 but nearest works better in your example in my opinion. Using data.table joins should never remove rows unless you specify nomatch = 0 – koolmees Jul 02 '21 at 14:40
Oh I see age in df1 contains 2 values. I'm not familiar with that. Currently the data is formatted so that John smith has a row for each age – sd3184 Jul 02 '21 at 15:01
Ohhhhh, I switched the names around. My bad. It should work if you reverse the names, I'll edit my post – koolmees Jul 02 '21 at 15:02

How do I match across large data sets using names, gender, race, and a rough estimate of age in R

1 Answers1