finding approximate matches in lists: hashing complex objects in python

Question

I am developing a python application that at its core performs the following task:

Internally, have a list of items, say, A. Go through another list of items, say B, and for every item b in list B, find the corresponding item a in A. There should always be such an item - if there is no match, the program shall fail.

Naturally, it seems that this would be a perfect application example for a binary search in a hash-ordered data structure.

The problem is that the items in the lists are complicated objects. For now, assume that each entry a or b is in itself a list of approximately 10 vectors (sometimes more, sometimes less) with such shape:

vector = [ id, status, px, py, pz ]

where id and status are integer values, and px, py and pz are floating point values, such that an item might look like this:

aExample = [ [   2, -1,  0.5,  0.7,  0.9 ], 
             [  -1, -1, -0.4, -0.6, -0.8 ],
             [  25,  2,  1.1,  1.3, -1.7 ],
             [  24,  2,  1.2,  1.1,  1.6 ],
             [ -24,  2,  0.9,  0.8,  2.1 ],
             [  11,  1,  1.2,  1.3,  2.6 ],
             [ -11,  1,  1.4,  1.2,  2.4 ],
             [  13,  1,  1.8,  1.6,  2.1 ],
             [ -11,  1,  3.2,  0.1,  3.6 ] ]

The list A stores a couple of hundred thousand of such entries, the list B stores a couple of ten thousand.

To add more complication,

A match requires the number of vectors to be the same, but not their ordering
A match requires all id and status values to match
A match does not require perfect agreement of the floating point values px, py, pz, but requires approximate agreement within some permille-level tolerance in order to be unique.

Now, as you might imagine, I'm having a hard time coming up with a data structure that will be efficient and maintainable.

The only feasible way at this point seems to me that I should try to first categorize my data in the values that require hard matches, and then performing a standard search for the "best soft match" within each category. But I'm curious: Are there any hashing algorithms that support this sort of hybrid approach of requiring some parts of the data to match exactly and others only approximately in order to produce a similar hash? Or do you have any other suggestion how to tackle this problem efficently?

Any hashing algorithm that hashes soft matches into the same bucket would qualify, no? — thebjorn, Oct 10 '15 at 12:48
Suppose you were to make a hash based only on `id` and `status` (e.g. by doing the sum of `id ^ status`). Would that lead to lots of collisions? If not, you could (1) find all the matches based on `id` and `status` and the number of vectors, and then (2) find the closest match based on the floating point values. Obviously this will only be practical if stage (1) reduces the number of possibilities to a very small number. — Paul Boddington, Oct 10 '15 at 12:53
When only taking into account `id` and `status`, this categorizes the data in approximately 2000 categories, which means that there are on average still 50-100 collisions. Also, I have reason to believe that the entries will be very unevenly distributed in these categories, with half a dozen of them making up for more than 50% of the entries (rough estimate). — carsten, Oct 10 '15 at 12:57
Hmmm, then you have a very tricky problem on your hands. I think you need two hashes, the first based on `id` and `status` and the second based on rounded versions of the floating point values. This could lead to 2 very close values being given a different hash, so you may need to search several buckets. E.g. for the number 7.49 you would have to search the buckets corresponding to both the rounded versions `7` and `8`. I've seen this problem before, and AFAIK there is no simple satisfactory solution. — Paul Boddington, Oct 10 '15 at 13:02
While it's sad to hear that there's no simple solution, it's satisfactory to hear that I'm not just being stupid. But even in this case, I'm wondering what hashing algorithms would be appropriate? For the floating point values, I could just use the product, since that will be stable with respect to decimal places (relative difference in inputs proportional to relative difference in outputs). But for the integers, I'm wondering what hashing would make sense. Any suggestions? — carsten, Oct 10 '15 at 13:10
@carsten This is a very interesting question. I will think about it and try to come up with a reasonable hash. Don't be surprised if this takes me several hours... — Paul Boddington, Oct 10 '15 at 13:14
Locality sensitive hashing maybe, but what's the problem with a dictionary of data structures supporting 3D proximity queries? — David Eisenstat, Oct 10 '15 at 13:23
@DavidEisenstat To me, the main complexity seems to come from the fact that it's unordered. It's only clear that `[[1, 1, 2.0, 3.0, 4.0], [1, 1, 5.0, 6.0, 7.0]]` is close to `[[1, 1, 5.0, 6.0, 7.1], [1, 1, 2.0, 3.1, 4.0]]` if you know you can swap the items. The question is about proximity in some kind of 3D space that is invariant under a cartesian product of symmetric groups, and I'm having difficulty getting my head around it. — Paul Boddington, Oct 10 '15 at 14:18
I think I've come up with an implementation for the "hard" part of the hash that does what I would like it to do, but i have no idea what performance to expect from this: `frozenset({ k:tuple(sorted([ v[0] for v in aExample if v[1] == k]) ) for k in set([ s[1] for s in aExample ]) }.items())` — carsten, Oct 10 '15 at 14:25

score 0 · Answer 1 · edited May 23 '17 at 12:15

I would approach it in the following way:

define a class with class variables id, status, px, py, pz called Vector
Vector class also has to define hash() and equal methods using formula:

def eq(self, other): self.id == other.id and self.status == other.status and bucketOf(self.px,self.py,self.pz) == bucketOf(other.px, other.py, oyther.pz)

def hash(): hash(id + status + bucketOf(px + py + pz) )

please note that hash is some hashing function with low hash collisions, + is not mathematical plus but a mere indicator, what gets included in the hash calculation, and bucketOf calculates a unique number/id within the range of px, py, and pz you consider close enough within your tolerances

Since hash cannot guarantee zero collisions, you need a dictionary mapping from hash to list of Vectors with the same hash value.

Then to find the matching instance for given instance of Vector, you calculate its hash and having the list of Vectors with the same hash value you filter it for elements where equals returns true.

This way the criterion of equality and the fast retrieval of instances depends solely on hash and eq defined by Vector class.

That's how it's done for HashMaps in Java.

Please see this answer explaining the data structure in more detail and the contract imposed on hash and eq.

my disclaimer is I am mostly a Java dev, so perhaps certain suggestions here are not easily transferable to Python

finding approximate matches in lists: hashing complex objects in python

1 Answers1