2

I have been learning to use the R RecordLinkage package recently. On a very small example with linking 2 datasets, one with 8 rows and the other with 11, I get the results:

Linkage Data Set

8 records in data set 1 
11 records in data set 2 
8 record pairs 

4 matches
4 non-matches
0 pairs with unknown status


Weight distribution:

[0.4,0.5] (0.5,0.6] (0.6,0.7] (0.7,0.8] (0.8,0.9]   (0.9,1] 
        2         0         2         0         1         3 

3 links detected 
0 possible links detected 
5 non-links detected 

alpha error: 0.250000
beta error: 0.000000
accuracy: 0.875000


Classification table:

           classification
true status N P L
      FALSE 4 0 0
      TRUE  1 0 3

What am failing to understand, is the relationship between the alpha error, beta error and accuracy with the Classification table. Where are the figures below coming from exactly, how are they calculated:

alpha error: 0.250000
beta error: 0.000000
accuracy: 0.875000

Any help greatly appreciated

Tumaini Kilimba
  • 195
  • 2
  • 15

1 Answers1

4

Alpha and beta error are statistical measures, more commonly known as type I and type II error, respectively. In statistical terms, the alpha error is the probability of rejecting the null hypothesis given that it is true; the beta error is the probability of asserting the null hypothesis given that it is not true (compare, for example http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2996198/).

In the case of record linkage, the null hypthesis is that a record pair is a match, i.e. the two records represent the same entity. Thus, the alpha error is the probability of labelling a pair as non-match given that it is really a match (false negative). This error is calculated as: (number of matches classified as 'non-link') / (number of matches).[1] In the above example, there are 4 matches, of which 1 is not recognized, thus, the alpha error is 1 / 4 = 0.25.

Similarly, beta error is the probability of classifying a pair as match given that it is really a non-match (false positive). It is calculated as (number of non-matches classified as 'link') / (number of non-matches). In the above example, there is no false positive classification, so the beta error is 0. Let's assume a different classification table:

           classification
true status N P L
      FALSE 2 0 2
      TRUE  1 0 3

In this case, there are 4 non-matches, of which 2 are falsely classified as links, so the beta error is 2 / 4 = 0.5.

Finally, accuracy is just the proportion of correct classifications among all pairs (see https://en.wikipedia.org/wiki/Evaluation_of_binary_classifiers#Single_metrics). In the classification table from the question, there are 7 correct classifications (4 non-matches, 3 matches), so accuracy is 7 / 8 = 0,875.

[1] I use '(non-)link' instead of '(non-)match' when I mean the outcome of the classification algorithm in contrast to the real status.

Andreas Borg
  • 174
  • 4