I would like to link two data frames that use different numeric keys, but similar strings. Specifically, one data frame uses a numeric key GVKEY
and a company name CONML
> head(temp.compustat[order(temp.compustat$CONML, decreasing = T), ])
GVKEY CONML
225994 13023 ZZZZ Best Co Inc
211017 11696 Zytrex Corp
213816 11951 Zytec Systems Inc
309886 29163 Zytec Corp
373950 129441 Zynex Inc
383184 145228 ZymoGenetics Inc
> dim(temp.compustat)
[1] 31354 2
The other data frame uses a different numeric key companyid
and a company name company
that may be slightly different than CONML
in the first database.
> head(temp.dealscan[ order(temp.dealscan$company, decreasing = T), ])
companyid company
70473 18192 Zytec Corp
32025 16969 Zynaxis Inc
19714 92271 ZYGO Teraoptix Inc
80473 13185 Zygo Corp
1901 24303 Zycon Corp SDN Bhd
33993 21219 Zycon Corp
> dim(temp.dealscan)
[1] 85818 2
(I am sorting in reverse, because the DealScan databases has blanks and * at the beginning of some entries). It seems that the function RLBigDataLinkage
in package RecordLinkage
is the solution, but I can't get even unsupervised linking to work. Here's my error.
> library(RecordLinkage)
> rpairs <- RLBigDataLinkage(dataset1 = temp.compustat, dataset2 = temp.dealscan, exclude = 1, strcmp = 2, strcmpfun = "levenshtein")
> result <- epiClassify(rpairs, threshold.upper = 0.5)
Error in if (max <= min) stop("must have max > min") :
missing value where TRUE/FALSE needed
In addition: Warning message:
In nData1 * nData2 : NAs produced by integer overflow
Coincident with me getting started yesterday, I saw this question about using the Levenshtein distance function manually. Is this approach a better option for me? These data frames are fairly large, so it seems that RLBigData
should be the correct approach (the package authors recommend it for > 1000 entries). Thanks!