3

This is very similar to the question @DavidArenburg asked about conditional keyed joins, with an additional bugbear that I can't seem to suss out.

Basically, in addition to a conditional join, I want to define a flag saying at which step of the matching process that the match occurred; my problem is that I can only get the flag to define for all values, not the matched values.

Here's what I hope is a minimal working example:

DT = data.table(
  name = c("Joe", "Joe", "Jim", "Carol", "Joe",
           "Carol", "Ann", "Ann", "Beth", "Joe", "Joe"),
  surname = c("Smith", "Smith", "Jones",
              "Clymer", "Smith", "Klein", "Cotter",
              "Cotter", "Brown", "Smith", "Smith"),
  maiden_name = c("", "", "", "", "", "Clymer",
                  "", "", "", "", ""),
  id = c(1, 1:3, rep(NA, 7)),
  year = rep(1:4, c(4, 3, 2, 2)),
  flag1 = NA, flag2 = NA, key = "year"
)

DT
#      name surname maiden_name id year flag1 flag2
#  1:   Joe   Smith              1    1 FALSE FALSE
#  2:   Joe   Smith              1    1 FALSE FALSE
#  3:   Jim   Jones              2    1 FALSE FALSE
#  4: Carol  Clymer              3    1 FALSE FALSE
#  5:   Joe   Smith             NA    2 FALSE FALSE
#  6: Carol   Klein      Clymer NA    2 FALSE FALSE
#  7:   Ann  Cotter             NA    2 FALSE FALSE
#  8:   Ann  Cotter             NA    3 FALSE FALSE
#  9:  Beth   Brown             NA    3 FALSE FALSE
# 10:   Joe   Smith             NA    4 FALSE FALSE
# 11:   Joe   Smith             NA    4 FALSE FALSE

My approach is, for each year, to first try and match on first name/last name from a prior year; if that fails, then try to match on first name/maiden name. I want to define flag1 to denote an exact match and flag2 to denote a marriage.

for (yr in 2:4) {

  #which ids have we hit so far?
  existing_ids = DT[.(yr), unique(id)]

  #find people in prior years appearing to
  #  correspond to those people
  unmatched = 
    DT[.(1:(yr - 1))][!id %in% existing_ids, .SD[.N], by = id]
  setkey(unmatched, name, surname)

  #merge a la Arun, define flag1
  setkey(DT, name, surname)
  DT[year == yr, c("id", "flag1") := unmatched[.SD, .(id, TRUE)]]
  setkey(DT, year)

  #repeat, this time keying on name/maiden_name
  existing_ids = DT[.(yr), unique(id)]
  unmatched = 
    DT[.(1:(yr - 1))][!id %in% existing_ids, .SD[.N],by=id]
  setkey(unmatched, name, surname)

  #now define flag2 = TRUE
  setkey(DT, name, maiden_name)
  DT[year==yr & is.na(id), c("id", "flag2") := unmatched[.SD, .(id, TRUE)]]
  setkey(DT, year)

  #this is messy, but I'm trying to increment id
  #  for "new" individuals
  setkey(DT, name, surname, maiden_name)
  DT[year == yr & is.na(id),
     id := unique(
       DT[year == yr & is.na(id)], 
       by = c("name", "surname", "maiden_name")
     )[ , count := .I][.SD, count] + DT[ , max(id, na.rm = TRUE)]
     ]

  #re-sort by year at the end    
  setkey(DT, year)    
}

I was hoping that by including the TRUE value in the j argument while I define id, only the matched names (e.g., Joe at the first step) would have their flag updated to TRUE, but this isn't the case--they are all updated:

DT[]
#      name surname maiden_name id year flag1 flag2
#  1: Carol  Clymer              3    1 FALSE FALSE
#  2:   Jim   Jones              2    1 FALSE FALSE
#  3:   Joe   Smith              1    1 FALSE FALSE
#  4:   Joe   Smith              1    1 FALSE FALSE
#  5:   Ann  Cotter              4    2  TRUE  TRUE
#  6: Carol   Klein      Clymer  3    2  TRUE  TRUE
#  7:   Joe   Smith              1    2  TRUE FALSE
#  8:   Ann  Cotter              4    3  TRUE FALSE
#  9:  Beth   Brown              5    3  TRUE  TRUE
# 10:   Joe   Smith              1    4  TRUE FALSE
# 11:   Joe   Smith              1    4  TRUE FALSE

Is there any way to update only the matched rows' flag values? Ideal output is as follows:

DT[]
#      name surname maiden_name id year flag1 flag2
#  1: Carol  Clymer              3    1 FALSE FALSE
#  2:   Jim   Jones              2    1 FALSE FALSE
#  3:   Joe   Smith              1    1 FALSE FALSE
#  4:   Joe   Smith              1    1 FALSE FALSE
#  5:   Ann  Cotter              4    2 FALSE FALSE
#  6: Carol   Klein      Clymer  3    2 FALSE  TRUE
#  7:   Joe   Smith              1    2  TRUE FALSE
#  8:   Ann  Cotter              4    3  TRUE FALSE
#  9:  Beth   Brown              5    3 FALSE FALSE
# 10:   Joe   Smith              1    4  TRUE FALSE
# 11:   Joe   Smith              1    4  TRUE FALSE
MichaelChirico
  • 33,841
  • 14
  • 113
  • 198

2 Answers2

3

I think flags are messy here; better to simply identify the source of the id:

dt[,c("flag1","flag2"):=NULL]

# create name -> id table
namemap <- unique(dt[,.(maiden_name,id,year),keyby=.(name,surname)],by=NULL)

# tag original ids
namemap[!is.na(id),src:="original"]

# carried over from earlier years
namemap[, has_oid := any(!is.na(id)), by=key(namemap)]
namemap[(has_oid),`:=`(
  id  = id[!is.na(id)],
  src = ifelse(is.na(id), "history", src)
),by=.(name,surname)]

# carry over for surname changes on marriage
namemap[maiden_name!="",`:=`(
  id  = namemap[.BY]$id,
  src = "maiden" 
),by=.(name,maiden_name)]

# create new ids where none exists
namemap[is.na(id),`:=`(
  id  = .GRP+max(dt$id,na.rm=TRUE),
  src = "new"
),by=.(name,surname)]

# copy back to the original table
setkey(dt,name,surname,year)
setkey(namemap,name,surname,year)
dt[namemap,`:=`(
  id  = i.id,
  src = src
)]

which gives

     name surname maiden_name id year      src
 1:   Ann  Cotter              4    2      new
 2:   Ann  Cotter              4    3      new
 3:  Beth   Brown              5    3      new
 4: Carol  Clymer              3    1 original
 5: Carol   Klein      Clymer  3    2   maiden
 6:   Jim   Jones              2    1 original
 7:   Joe   Smith              1    1 original
 8:   Joe   Smith              1    1 original
 9:   Joe   Smith              1    2  history
10:   Joe   Smith              1    4  history
11:   Joe   Smith              1    4  history

The original ordering of the data is lost, but it is easy to recover if you want it.

Frank
  • 66,179
  • 8
  • 96
  • 180
  • so basically, we take the results of the merge I was doing and merge it to the original table? – MichaelChirico Jul 10 '15 at 02:41
  • @MichaelChirico I've updated my answer. This is probably what I'd do. No reference to years is necessary, I think. – Frank Jul 10 '15 at 02:53
  • I'm afraid I'm guilty of oversimplifying my working example orz working on something more accurate to what I'm going for now – MichaelChirico Jul 10 '15 at 14:52
  • Updated, much more complicated; still simpler than what I've actually got going on, but I think I have all the necessary nuances now – MichaelChirico Jul 10 '15 at 15:42
  • @MichaelChirico Okay. Updated my answer. Again, messing around with the years is just complicating things. You have a name => id mapping for a subset of observations and need to extend the mapping to all observations. It is not important what year the existing name => id mapping comes from. And it's really better to store that mapping in a separate table (well, that's my strong preference, anyways). – Frank Jul 10 '15 at 16:37
  • Thanks, and thanks for your time, it's nice to see another approach on how to think about this. – MichaelChirico Jul 10 '15 at 17:36
  • @MichaelChirico Okay, final edit now that I've seen the desired output. I hadn't realized that you cared about the additional case of "saw this person before and am carrying over their id" (now labeled "history"). Anyway, it comes down to personal preference, but in my work I'd (obnoxiously) insist that any data I'm working with be structured this way, rather than with multiple numbered flags that can each take on three values. – Frank Jul 10 '15 at 18:01
  • I hear you, but in the grand context of my larger project (hinted at [here](http://stackoverflow.com/questions/29176114/using-name-full-name-and-maiden-name-strings-and-birthdays-to-match-individual)), I want the flags. – MichaelChirico Jul 10 '15 at 18:22
  • @MichaelChirico Oh wow, that is monstrous code there :) It does not convince me that flags are necessary, however. If you need dummy variables (1) they can be constructed at the time of regression and (2) deliberately storing NAs among them is a recipe for a headache. Anyway, good luck with cleaning that data! – Frank Jul 10 '15 at 18:32
  • 1
    Yes, the code has cleaned up a lot since then, but you get the point that it's quite byzantine. String data gives me nightmares... – MichaelChirico Jul 10 '15 at 18:36
0

The key (no pun intended) I think was to realize that the merge was returning NA for the missed IDs, so I should add the flag to unmatched at each step, e.g., at step 1:

unmatched <- dt[.(1:(yr - 1L))
                ][!id %in% existing_ids,
                  .SD[.N], by = id][ , flag1 := TRUE]
dt[year == yr, c("id", "flag1") := 
     unmatched[.SD, .(id, flag1), on = "name,surname"]]

In the end, this produces:

> dt[ ]
     name surname maiden_name id year flag1 flag2
 1: Carol  Clymer              3    1 FALSE FALSE
 2:   Jim   Jones              2    1 FALSE FALSE
 3:   Joe   Smith              1    1 FALSE FALSE
 4:   Joe   Smith              1    1 FALSE FALSE
 5:   Ann  Cotter              4    2    NA    NA
 6: Carol   Klein      Clymer  3    2    NA  TRUE
 7:   Joe   Smith              1    2  TRUE FALSE
 8:   Ann  Cotter              4    3  TRUE FALSE
 9:  Beth   Brown              5    3    NA    NA
10:   Joe   Smith              1    4  TRUE FALSE
11:   Joe   Smith              1    4  TRUE FALSE

One problem remaining is that some flags that should be F have reset to NA; would be nice to be able to set nomatch=F, but I'm not too worried about this side effect--the key for me is knowing when each flag is T.

MichaelChirico
  • 33,841
  • 14
  • 113
  • 198