I am trying to use the R package RecordLinkage
to match items in the purchase orders list with entries in the master catalogue. Below is the R code and a reproducible example using two dummy datasets (DOrders and DCatalogue):
DOrders <- structure(list(Product = structure(c(1L, 2L, 7L, 3L, 4L, 5L,
6L), .Label = c("31471 - SOFTSILK 2.0 SCREW 7mm x 20mm", "Copier paper white A4 80gsm",
"High resilience memory foam standard mattress", "Liston forceps bone cutting 152mm",
"Micro reciprocating blade 25.4mm x 8.0mm x 0.38mm", "Micro reciprocating blade 39.5 x 7.0 x 0.38",
"microaire dual tooth 18 x 90 x 0.89"), class = "factor"), Supplier = structure(c(5L,
6L, 2L, 1L, 4L, 3L, 3L), .Label = c("KAROMED LTD", "Morgan Steer Ortho Limited",
"ORTHOPAEDIC SOLUTIONS", "SURGICAL HOLDINGS", "T J SMITH NEPHEW LTD",
"XEROX SOLUTIONS"), class = "factor"), UOI = structure(c(1L,
1L, 1L, 1L, 1L, 1L, 2L), .Label = c("Each", "Pack"), class = "factor"),
Price = c(5.99, 6.99, 40, 230, 35, 80, 79)), .Names = c("Product",
"Supplier", "UOI", "Price"), class = "data.frame", row.names = c(NA,
-7L))
DCatalogue <- structure(list(Product = structure(c(7L, 3L, 4L, 5L, 6L, 2L,
8L, 1L), .Label = c("7.0mm cann canc scr 32x80mm non sterile single use",
"A4 80gsm white copier paper", "High resilience memory foam standard hospital mattress with stitched seams has a fully enclosing cover",
"Liston bone cutting forceps with fluted handle straight 152mm",
"Micro reciprocating blade 25.4mm x 8.0mm x 0.38mm", "Micro reciprocating blade 39.5mm x 7.0mm x 0.38mm",
"microaire large osc dual tooth 18mm x 90mm x 0.89mm", "Softsilk 2.0 pkg 7x20 ster"
), class = "factor"), Supplier = structure(c(3L, 2L, 6L, 4L,
4L, 7L, 5L, 1L), .Label = c("BIOMET MERCK LTD", "KAROMED LIMITED",
"MORGAN STEER ORTHOPAEDICS LTD", "ORTHO SOLUTIONS", "SMITH & NEPHEW ADVANCED SURGICAL DEVICES",
"SURGICAL HOLDINGS", "XEROX"), class = "factor"), UOI = structure(c(1L,
1L, 1L, 2L, 2L, 1L, 1L, 1L), .Label = c("Each", "Pack"), class = "factor"),
RefPrice = c(38.7, 274.18, 34.96, 79.48, 81.29, 6.99, 5.99,
5)), .Names = c("Product", "Supplier", "UOI", "RefPrice"), class = "data.frame", row.names = c(NA,
-8L))
For the purpose of experimenting, the DOrders has 7 entries, each of which matches with one of nine rows in the reference set DCatalogue. In the real data, not all orders would match.
head(DOrders)
Product Supplier UOI Price
1 31471 - SOFTSILK 2.0 SCREW 7mm x 20mm T J SMITH NEPHEW LTD Each 5.99
2 Copier paper white A4 80gsm XEROX SOLUTIONS Each 6.99
3 microaire dual tooth 18 x 90 x 0.89 Morgan Steer Ortho Limited Each 40.00
4 High resilience memory foam standard mattress KAROMED LTD Each 230.00
5 Liston forceps bone cutting 152mm SURGICAL HOLDINGS Each 35.00
6 Micro reciprocating blade 25.4mm x 8.0mm x 0.38mm ORTHOPAEDIC SOLUTIONS Each 80.00
> head(DCatalogue)
Product Supplier UOI RefPrice
1 microaire large osc dual tooth 18mm x 90mm x 0.89mm MORGAN STEER ORTHOPAEDICS LTD Each 38.70
2 High resilience memory foam standard hospital mattress with stitched seams has a fully enclosing cover KAROMED LIMITED Each 274.18
3 Liston bone cutting forceps with fluted handle straight 152mm SURGICAL HOLDINGS Each 34.96
4 Micro reciprocating blade 25.4mm x 8.0mm x 0.38mm ORTHO SOLUTIONS Pack 79.48
5 Micro reciprocating blade 39.5mm x 7.0mm x 0.38mm ORTHO SOLUTIONS Pack 81.29
6 A4 80gsm white copier paper XEROX Each 6.99
The first step in the linkage is to ensure that items match by unit of issue (UOI). This is because a pack of items is obviously not the same as one unit, even if the items are exactly the same. E.g.:
Micro reciprocating blade 25.4mm x 8.0mm x 0.38mm ORTHOPAEDIC SOLUTIONS Each 80.00
Is the same item, but should be a non-match to:
Micro reciprocating blade 25.4mm x 8.0mm x 0.38mm ORTHO SOLUTIONS Pack 79.48
Hence, I am using the blocking argument blockfld = 3
, to attempt to match only those entries with identical values in the 3rd column. Also, using exclude = 4
, to exclude the price from matching. This will be different between the Orders and the Catalogue and is itself the main interest of matching. The matching is done using jarowinkler
string comparator (as described here) on Product and Supplier names:
library(RecordLinkage)
rpairs <- compare.linkage(DOrders, DCatalogue,
blockfld = 3,
exclude = 4,
strcmp = 1:2,
strcmpfun = jarowinkler)
Next, I am computing the weights for each pair using the Contiero et al. (2005) method:
rpairs <- epiWeights(rpairs)
> summary(rpairs)
Weight distribution:
[0.3,0.4] (0.4,0.5] (0.5,0.6] (0.6,0.7] (0.7,0.8] (0.8,0.9] (0.9,1]
1 1 19 10 3 0 4
Based on this distribution, I want to classify as matches only those pairs with a weight > 0.7
result <- epiClassify(rpairs, 0.7)
> summary(result)
7 links detected
0 possible links detected
31 non-links detected
This is as far as I got, but there are some problems with this.
First, getPairs(result)
shows that one entry from DOrders can have a high weight match with more than one entry in the DCatalogue. E.g.
This pair is correctly matched, with a weight of 0.948
Micro reciprocating blade 39.5 x 7.0 x 0.38 ORTHOPAEDIC SOLUTIONS Pack 79
Micro reciprocating blade 39.5mm x 7.0mm x 0.38mm ORTHO SOLUTIONS Pack 81.29 0.9480503
but is also matched incorrectly with a weight of 0.928:
Micro reciprocating blade 39.5 x 7.0 x 0.38 ORTHOPAEDIC SOLUTIONS Pack 79
Micro reciprocating blade 25.4mm x 8.0mm x 0.38mm ORTHO SOLUTIONS Pack 79.48 0.9283522
Obviously, I need to restrict pairing to only one best match with the highest weight, but how to do it?
And finally, the end result that I am looking for is a merged dataset that contains matched entries from both Orders and Catalogue in one row, with all columns from both original sets side by side for comparison. getPairs
produces an output in an awkward format:
> getPairs(result)
id Product Supplier UOI Price Weight
1 7 Micro reciprocating blade 39.5 x 7.0 x 0.38 ORTHOPAEDIC SOLUTIONS Pack 79
2 5 Micro reciprocating blade 39.5mm x 7.0mm x 0.38mm ORTHO SOLUTIONS Pack 81.29 0.9480503
3
4 5 Liston forceps bone cutting 152mm SURGICAL HOLDINGS Each 35
5 3 Liston bone cutting forceps with fluted handle straight 152mm SURGICAL HOLDINGS Each 34.96 0.9329244
...