1

I'm trying to find an algorithm for checking similarities between two data entries. Say I have two data structures (fields in contact's list) with following data:

// UserA addressbook.
name: Frank Sinatra
mobile: +44 555 555 555 55

// UserB addressbook.
name: Frank Albert Sinatra
phone: 004455555555555

I got those entries from different providers, UserA synced his Google account, while UserB synced his Microsoft account, but I want my algorithm to tell me that both users know same guy (within some probability).

Does anyone know where should I look into? I've tried to find hashing algorithm that creates "unsafe" hashes, i.e. similar hashes for similar data, but that route wasn't productive.

David Sergey
  • 364
  • 1
  • 3
  • 18
  • For starters, you could isolate all names (first name, surname, middle name) into an array, and strip all "+" signs, spaces and leading zeroes from the number. Then check if the numbers match and if one of the arrays contains some of the elements of the other one. –  Nov 29 '13 at 11:45
  • Data structures are just an examples. It might be custom fields, or list of messages. I need to compare two data structures. – David Sergey Nov 29 '13 at 12:02
  • Hmmm. I don't think a general similarity algorithm will work in your example case. Two similar phone numbers are essentially different. "Miller R", "Robert Miller" and "Miller Bob" might refer to the same person and are lexically different, although they share a common sub-word. I think that you colud be more successful if you normalised the data somehow as H2CO3 suggested and then used a custom comparison for each field telling you wether A _might_ be B, such that eg, "J Rye" < "Jane Rye" < "Jane F E Rye" and "Rye, J" == "J Rye". – M Oehm Nov 29 '13 at 12:58

2 Answers2

1

The similarity of strings can be determined with the Levenshtein distance. The strings should be prepared before the test, eg remove special character or split the string. For data structures have a look at How do you measure similarity between 2 series of data?

Community
  • 1
  • 1
Daniel
  • 459
  • 6
  • 16
  • And the Levenshtein distance of two telephone numbers tells you what? Two numbers with a distance of 1 might even be in another country, depending on where the difference is. – M Oehm Nov 29 '13 at 13:01
0

some keywords you could further look into: data similarity, distance/similarity measures (metrics), correlation, inexact matching.

ile
  • 339
  • 1
  • 7