I am trying to generate a unique ID column using the RecordLinkage package. I have successfully done so when working with smaller datasets (<= 1,000,000), but have not been able to reproduce this result for larger datasets (> 1,000,000) that use different (but similar) functions in the package. I am given multiple identifier variables for which I want to generate a unique ID despite the fact that there may be some errors (near matches) or duplicates in the records.
Given some data frame of identifiers:
data(RLdata500)
df_identifiers <- RLdata500
This is the code for the smaller datesets (which work):
df_identifiers <- df_identifiers %>% mutate(ID = 1:nrow(df_identifiers))
rpairs <- compare.dedup(df_identifiers)
p=epiWeights(rpairs)
classify <- epiClassify(p,0.3)
matches <- getPairs(object = classify, show = "links", single.rows = TRUE)
# this code writes an "ID" column that is the same for similar identifiers
classify <- matches %>% arrange(ID.1) %>% filter(!duplicated(ID.2))
df_identifiers$ID_prior <- df_identifiers$ID
# merge matching information with the original data
df_identifiers <- left_join(df_identifiers, matches %>% select(ID.1,ID.2), by=c("ID"="ID.2"))
# replace matches in ID with the thing they match with from ID.1
df_identifiers$ID <- ifelse(is.na(df_identifiers$ID.1), df_identifiers$ID, df_identifiers$ID.1)
This approach is discussed here. But this code does not seem to be extensible when applied towards larger datasets when using other functions. For example, the big data equivalent of compare.dedup
is RLBigDataDedup
, whose RLBigData
class support similar functions such as epiWeights
, epiClassify
, getPairs
, etc. Replacing compare.dedup
with RLBigDataDedup
does not work in this situation.
Consider the following attempt for large datasets:
df_identifiers <- df_identifiers %>% mutate(ID = 1:nrow(df_identifiers))
rpairs <- RLBigDataDedup(df_identifiers)
p=epiWeights(rpairs)
( . . . )
Here, the remaining code is almost identical to that of the first. Although epiWeights
and epiClassify
work on the RLBigData
class as expected, getPairs
does not. The function getPairs
does not use the show = "links"
argument. Because of this, all subsequent code does not work.
Is there a different approach that needs to be taken to generate a column of unique IDs when working with larger datasets in the RLBigData
class, or is this just a limitation?