1

When clustering I receive the following warning

UserWarning: A component contained 77760 elements. 
Components larger than 30000 are re-filtered. 
The threshold for this filtering is 4.08109134074e-15

What does this mean?

My original thereshold specification was 0.191 as below

clustered_dupes = deduper.match(data,threshold=0.191)
Rtab
  • 123
  • 10

1 Answers1

0

the threshold is for the cophenetic similarity of a cluster not pairwise similarity.

fgregg
  • 3,173
  • 30
  • 37
  • So i get that the threshold referred to in the error is not the pairwise threshold.. but can the cause & any general tips be shared for how to handle this error? What might be best to alter in the various settings (recall, threshold.. maybe more fields to use..) Is the cause simply that there are too many records being suggested as matches? (likely a too simplistic statement). but i have a group of about 5 million records of people, and I'm trying to perform matching on them using name, phone, birth date, address.. and i often see this error before my server reboots – da Bich Aug 16 '23 at 19:50