RecordLinkage: how to pair only best matches and export a merged table?

Question

I am trying to use the R package RecordLinkage to match items in the purchase orders list with entries in the master catalogue. Below is the R code and a reproducible example using two dummy datasets (DOrders and DCatalogue):

DOrders <- structure(list(Product = structure(c(1L, 2L, 7L, 3L, 4L, 5L, 
6L), .Label = c("31471 - SOFTSILK 2.0 SCREW 7mm x 20mm", "Copier paper white A4 80gsm", 
"High resilience memory foam standard  mattress", "Liston forceps bone cutting 152mm", 
"Micro reciprocating blade 25.4mm x 8.0mm x 0.38mm", "Micro reciprocating blade 39.5 x 7.0 x 0.38", 
"microaire dual tooth 18 x 90 x 0.89"), class = "factor"), Supplier = structure(c(5L, 
6L, 2L, 1L, 4L, 3L, 3L), .Label = c("KAROMED LTD", "Morgan Steer Ortho Limited", 
"ORTHOPAEDIC SOLUTIONS", "SURGICAL HOLDINGS", "T J SMITH NEPHEW LTD", 
"XEROX SOLUTIONS"), class = "factor"), UOI = structure(c(1L, 
1L, 1L, 1L, 1L, 1L, 2L), .Label = c("Each", "Pack"), class = "factor"), 
    Price = c(5.99, 6.99, 40, 230, 35, 80, 79)), .Names = c("Product", 
"Supplier", "UOI", "Price"), class = "data.frame", row.names = c(NA, 
-7L))

DCatalogue <- structure(list(Product = structure(c(7L, 3L, 4L, 5L, 6L, 2L, 
8L, 1L), .Label = c("7.0mm cann canc scr 32x80mm non sterile single use", 
"A4 80gsm white copier paper", "High resilience memory foam standard hospital mattress with stitched seams has a fully enclosing cover", 
"Liston bone cutting forceps with fluted handle straight 152mm", 
"Micro reciprocating blade 25.4mm x 8.0mm x 0.38mm", "Micro reciprocating blade 39.5mm x 7.0mm x 0.38mm", 
"microaire large osc dual tooth 18mm x 90mm x 0.89mm", "Softsilk 2.0 pkg 7x20 ster"
), class = "factor"), Supplier = structure(c(3L, 2L, 6L, 4L, 
4L, 7L, 5L, 1L), .Label = c("BIOMET MERCK LTD", "KAROMED LIMITED", 
"MORGAN STEER ORTHOPAEDICS LTD", "ORTHO SOLUTIONS", "SMITH & NEPHEW ADVANCED SURGICAL DEVICES", 
"SURGICAL HOLDINGS", "XEROX"), class = "factor"), UOI = structure(c(1L, 
1L, 1L, 2L, 2L, 1L, 1L, 1L), .Label = c("Each", "Pack"), class = "factor"), 
    RefPrice = c(38.7, 274.18, 34.96, 79.48, 81.29, 6.99, 5.99, 
    5)), .Names = c("Product", "Supplier", "UOI", "RefPrice"), class = "data.frame", row.names = c(NA, 
-8L))

For the purpose of experimenting, the DOrders has 7 entries, each of which matches with one of nine rows in the reference set DCatalogue. In the real data, not all orders would match.

head(DOrders)
                                            Product                   Supplier  UOI  Price
1             31471 - SOFTSILK 2.0 SCREW 7mm x 20mm       T J SMITH NEPHEW LTD Each   5.99
2                       Copier paper white A4 80gsm            XEROX SOLUTIONS Each   6.99
3               microaire dual tooth 18 x 90 x 0.89 Morgan Steer Ortho Limited Each  40.00
4    High resilience memory foam standard  mattress                KAROMED LTD Each 230.00
5                 Liston forceps bone cutting 152mm          SURGICAL HOLDINGS Each  35.00
6 Micro reciprocating blade 25.4mm x 8.0mm x 0.38mm      ORTHOPAEDIC SOLUTIONS Each  80.00

> head(DCatalogue)
                                                                                                 Product                      Supplier  UOI RefPrice
1                                                    microaire large osc dual tooth 18mm x 90mm x 0.89mm MORGAN STEER ORTHOPAEDICS LTD Each    38.70
2 High resilience memory foam standard hospital mattress with stitched seams has a fully enclosing cover               KAROMED LIMITED Each   274.18
3                                          Liston bone cutting forceps with fluted handle straight 152mm             SURGICAL HOLDINGS Each    34.96
4                                                      Micro reciprocating blade 25.4mm x 8.0mm x 0.38mm               ORTHO SOLUTIONS Pack    79.48
5                                                      Micro reciprocating blade 39.5mm x 7.0mm x 0.38mm               ORTHO SOLUTIONS Pack    81.29
6                                                                            A4 80gsm white copier paper                         XEROX Each     6.99

The first step in the linkage is to ensure that items match by unit of issue (UOI). This is because a pack of items is obviously not the same as one unit, even if the items are exactly the same. E.g.:

Micro reciprocating blade 25.4mm x 8.0mm x 0.38mm      ORTHOPAEDIC SOLUTIONS Each  80.00

Is the same item, but should be a non-match to:

Micro reciprocating blade 25.4mm x 8.0mm x 0.38mm               ORTHO SOLUTIONS Pack    79.48

Hence, I am using the blocking argument blockfld = 3, to attempt to match only those entries with identical values in the 3rd column. Also, using exclude = 4 , to exclude the price from matching. This will be different between the Orders and the Catalogue and is itself the main interest of matching. The matching is done using jarowinkler string comparator (as described here) on Product and Supplier names:

library(RecordLinkage)

rpairs <- compare.linkage(DOrders, DCatalogue, 
                          blockfld = 3,
                          exclude = 4,
                          strcmp = 1:2,
                          strcmpfun = jarowinkler)

Next, I am computing the weights for each pair using the Contiero et al. (2005) method:

rpairs <- epiWeights(rpairs)
> summary(rpairs)
Weight distribution:

[0.3,0.4] (0.4,0.5] (0.5,0.6] (0.6,0.7] (0.7,0.8] (0.8,0.9]   (0.9,1] 
        1         1        19        10         3         0         4

Based on this distribution, I want to classify as matches only those pairs with a weight > 0.7

result <- epiClassify(rpairs, 0.7)
> summary(result)
7 links detected 
0 possible links detected 
31 non-links detected

This is as far as I got, but there are some problems with this.

First, getPairs(result) shows that one entry from DOrders can have a high weight match with more than one entry in the DCatalogue. E.g.

This pair is correctly matched, with a weight of 0.948

Micro reciprocating blade 39.5 x 7.0 x 0.38 ORTHOPAEDIC SOLUTIONS   Pack    79  
Micro reciprocating blade 39.5mm x 7.0mm x 0.38mm   ORTHO SOLUTIONS Pack    81.29   0.9480503

but is also matched incorrectly with a weight of 0.928:

Micro reciprocating blade 39.5 x 7.0 x 0.38 ORTHOPAEDIC SOLUTIONS   Pack    79  
Micro reciprocating blade 25.4mm x 8.0mm x 0.38mm   ORTHO SOLUTIONS Pack    79.48   0.9283522

Obviously, I need to restrict pairing to only one best match with the highest weight, but how to do it?

And finally, the end result that I am looking for is a merged dataset that contains matched entries from both Orders and Catalogue in one row, with all columns from both original sets side by side for comparison. getPairs produces an output in an awkward format:

> getPairs(result)
    id  Product Supplier    UOI Price   Weight
1   7   Micro reciprocating blade 39.5 x 7.0 x 0.38 ORTHOPAEDIC SOLUTIONS   Pack    79  
2   5   Micro reciprocating blade 39.5mm x 7.0mm x 0.38mm   ORTHO SOLUTIONS Pack    81.29   0.9480503
3                       
4   5   Liston forceps bone cutting 152mm   SURGICAL HOLDINGS   Each    35  
5   3   Liston bone cutting forceps with fluted handle straight 152mm   SURGICAL HOLDINGS   Each    34.96   0.9329244
...

score 5 · Answer 1 · edited May 23 '17 at 11:47

First of all, thank you for providing a reproducable example, which eases answering your questions a lot. I will start with your second question:

And finally, the end result that I am looking for is a merged dataset that contains matched entries from both Orders and Catalogue in one row, with all columns from both original sets side by side for comparison.

With single.rows=TRUE, getPairs lists both entries in one line. Furthermore, show="links" limits the output to pairs classified as belonging together (see ?getPairs for details):

> matchedPairs <- getPairs(result, single.rows=TRUE, show="links")

However, this does not put matching columns next to each other, but all columns of record one are followed by all columns of record two (and finally the matching weight as last column). I show only the column names here as the whole table is really wide:

> names(matchedPairs)
 [1] "id1"        "Product.1"  "Supplier.1" "UOI.1"      "Price.1"    "id2"        "Product.2"  "Supplier.2" "UOI.2"      "RefPrice.2" "Weight"

So if you want direct column-to-column comparison in this format, you have to rearrange the colums to fit your needs.

Obviously, I need to restrict pairing to only one best match with the highest weight, but how to do it?

This functionality is not provided by the package and I believe the process of choosing one-to-one assignments from a record linkage result needs some conceptual attention on its own. I have never gone deeply into this step, so the following might just be an idea to start with. You can use the data.table library to choose from each group of pairs with the same left-hand id the one with maximum weight (compare How to select the row with the maximum value in each group):

> library(data.table)
> matchedPairs <- data.table(matchedPairs)
> matchedPairs[matchedPairs[,.I[which.max(Weight)],by=id1]$V1, list(id1,id2)]
   id1 id2
1:   7   5
2:   5   3
3:   4   2
4:   2   6
5:   6   1
6:   3   1

Here, list(id1,id2) limits the output to the record ids.

In order to eliminate also double mappings for the right hand ids (in this case, 1 appears twice for id2) you would have to repeat the process for id2. Note however, that in some situations choosing pairs with highest weight in step 1 (reducing to unique values for id1) may remove pairs where the weight is maximal for a given value of id2. So for choosing an optimal overall mapping (e.g. maximizing the sum of weights of all chosen mappings) a non-greedy optimization strategy is needed.

Update: Using classes and methods for big data sets

For large datasets, the so called "big data" classes and methods can be used (see https://cran.r-project.org/web/packages/RecordLinkage/vignettes/BigData.pdf). These use file-backed data structures, so the size limit is the available disk space. The syntax is mostly, but not completely identical. For this example, the necessary calls to achieve the same result as above would be:

rpairs <- RLBigDataLinkage(DOrders, DCatalogue, 
                      blockfld = 3,
                      exclude = 4,
                      strcmp = 1:2,
                      strcmpfun = "jarowinkler")

rpairs <- epiWeights(rpairs)
result <- epiClassify(rpairs, 0.7)
matchedPairs <- getPairs(result, single.rows=TRUE, filter.link="link")
matchedPairs <- data.table(matchedPairs)
matchedPairs[matchedPairs[,.I[which.max(Weight)],by=id.1]$V1, list(id.1,id.2)]

However, concerning your size estimation of 2 TB, this is still not feasible. I think you have to further reduce the number of pairs by additional blocking.

The problem in this case is that the package only supports "hard" blocking criteria (i.e. two records must match exactly in the blocking field). When linking personal data (which was our use case while developing the package), the day, month and year components of the date of birth can usually be combined for blocking in such a way that the number of pairs is significantly reduced without missing match candidates. As far as I can judge from the examples, further "hard" blocking is not possible for your data, as the matching pairs have only similar, but not equal attribute values (aside from the "unit of issue", which you already use for blocking). A criterium like "only consider pairs where the string similarity of the product names is greater than [some threshold]" seems most appropriate to me. To achieve this, you would have to extend compare.linkage() or RLBigDataLinkage().

thank you for your very helpful answer. One note though, with my specific aim I only need to remove double mappings in id1, not id2 (i.e. an order can match only one entry in a master reference catalogue, but an entry in the catalogue can be ordered multiple times). — Mihail, Nov 04 '16 at 13:10
Also, unfortunately, while the code works perfectly on the experimental toy datasets, it does not work at all on the real datasets. This is because a typical list of orders has c. 150,000 entries and the master catalogue 400,000 entries. R simply does not have enough memory to create the pairs object, which would need to have c. 30 trillion pairs. My ballpark estimates suggest that such an object would require in the region of 2 Tb of memory. Unless anyone has any ideas, it seems that the linking of real data would need a whole different approach. — Mihail, Nov 04 '16 at 13:14

RecordLinkage: how to pair only best matches and export a merged table?

1 Answers1

Update: Using classes and methods for big data sets