I am developing a python application that at its core performs the following task:
Internally, have a list of items, say, A
. Go through another list of items, say B
, and for every item b
in list B
, find the corresponding item a
in A
. There should always be such an item - if there is no match, the program shall fail.
Naturally, it seems that this would be a perfect application example for a binary search in a hash-ordered data structure.
The problem is that the items in the lists are complicated objects. For now, assume that each entry a
or b
is in itself a list of approximately 10 vectors (sometimes more, sometimes less) with such shape:
vector = [ id, status, px, py, pz ]
where id
and status
are integer values, and px
, py
and pz
are floating point values, such that an item might look like this:
aExample = [ [ 2, -1, 0.5, 0.7, 0.9 ],
[ -1, -1, -0.4, -0.6, -0.8 ],
[ 25, 2, 1.1, 1.3, -1.7 ],
[ 24, 2, 1.2, 1.1, 1.6 ],
[ -24, 2, 0.9, 0.8, 2.1 ],
[ 11, 1, 1.2, 1.3, 2.6 ],
[ -11, 1, 1.4, 1.2, 2.4 ],
[ 13, 1, 1.8, 1.6, 2.1 ],
[ -11, 1, 3.2, 0.1, 3.6 ] ]
The list A
stores a couple of hundred thousand of such entries, the list B
stores a couple of ten thousand.
To add more complication,
- A match requires the number of vectors to be the same, but not their ordering
- A match requires all
id
andstatus
values to match - A match does not require perfect agreement of the floating point values
px
,py
,pz
, but requires approximate agreement within some permille-level tolerance in order to be unique.
Now, as you might imagine, I'm having a hard time coming up with a data structure that will be efficient and maintainable.
The only feasible way at this point seems to me that I should try to first categorize my data in the values that require hard matches, and then performing a standard search for the "best soft match" within each category. But I'm curious: Are there any hashing algorithms that support this sort of hybrid approach of requiring some parts of the data to match exactly and others only approximately in order to produce a similar hash? Or do you have any other suggestion how to tackle this problem efficently?