1

I have a list of dictionaries, which are sorted on one property:

[{"name":"efi", "sortedProp":"01df"},
 {"name":"abe", "sortedProp":"1de9"},
 {"name":"bit", "sortedProp":"e182"}...]

I want to find the closest value of sortedProp to a given value (say for example 1dff should return the second dict here). I know I can use bisect for this for optimal speed (performance is important, since the list is ~30,000 dicts long), and I've found answers for finding exact values in a list of dicts, and finding closest value in a list of ints, but I can't seem to find an answer for finding the closest value in a list of dicts.

edit: I'm not asking how to sort the list of dicts, I already have done that. I'm asking how to find the dict that contains the closest value of sortedProp to a given value.

edit2: I'm using hexadecimal numbers (results from a phash run) as the sorted property, so 'closest' is defined as the Hamming distance between the two hashes.

Community
  • 1
  • 1
IronWaffleMan
  • 2,513
  • 5
  • 30
  • 59
  • Possible duplicate of [Sort a Python dictionary by value](http://stackoverflow.com/questions/613183/sort-a-python-dictionary-by-value) – AlexLordThorsen Sep 19 '16 at 23:07
  • How are you defining "Closest?" That's critical information. Something like Levenshtein distance? Or do you consider `sortedProp`'s values to be hex numbers and counting it that way? – Adam Smith Sep 19 '16 at 23:13
  • `bisect` seems perfect, what was wrong with that? – wim Sep 19 '16 at 23:14
  • Use `bisect` on a list that contains just the `sortedProp` values or, since `sortedProp` seems to contain hexadecimal values you can add an hex to int conversion to http://stackoverflow.com/questions/12141150/from-list-of-integers-get-number-closest-to-a-given-value. – Eugen Constantin Dinca Sep 19 '16 at 23:16
  • 1
    @wim Nothing, I just don't know how to use it on a list of dicts. The only examples I've found where it's looking for a near value (not an exact match) use a list of ints. – IronWaffleMan Sep 19 '16 at 23:17
  • 1
    Are you checking multiple values or just one? – Padraic Cunningham Sep 19 '16 at 23:36
  • How is it sorted, and how do you know you can use `bisect` for this? I rather doubt it. – Stefan Pochmann Sep 19 '16 at 23:39
  • @StefanPochmann I am sorting it on the phash values (which are after all just really big hexadecimal numbers). I assumed I can use bisect on this since bisect works on a list of integers (http://stackoverflow.com/questions/12141150/from-list-of-integers-get-number-closest-to-a-given-value), so why not a list of dictionaries that contain integers? – IronWaffleMan Sep 19 '16 at 23:41
  • 1
    @IronWaffleMan, the hamming distance and hex values are two different things, `min(l, key=lambda d: abs(int(d["sortedProp"], 16) - int("somhex", 16)))` and `min(l, key=lambda d: distance.hamming(d["sortedProp"], "somehex"))` could give completely different results – Padraic Cunningham Sep 19 '16 at 23:43
  • @PadraicCunningham Fair point. When I load the pickle file containing these hexes I convert them back to hashes (I'm using `ImageHash` library for this), and the test hash value I'm trying to find is also of the same type, so doing `testHash - fileHash` is effectively calculating the hamming distance. I assume though `bisect` won't make that calculation properly? – IronWaffleMan Sep 19 '16 at 23:46
  • *"`testHash - fileHash` is effectively calculating the hamming distance"*? How so? What data type are `testHash` and `fileHash`? – Stefan Pochmann Sep 19 '16 at 23:49
  • @StefanPochmann I believe it's some `ImageHash` type. Here's the function that converts it from a hex to a hash: https://github.com/JohannesBuchner/imagehash/blob/master/imagehash/__init__.py#L86 – IronWaffleMan Sep 19 '16 at 23:50
  • 1
    @IronWaffleMan Ah, something special then. Anyway, I don't see how you can meaningfully sort them for Hamming distance except by sorting them for a specific target, which is of course pointless because then you could just search for the closest to that target right away. – Stefan Pochmann Sep 19 '16 at 23:58
  • @IronWaffleMan, are you doing this multiple times after sorting or just for one value? – Padraic Cunningham Sep 20 '16 at 00:01
  • @PadraicCunningham Multiple times. The list of dicts is sorted once and then I have multiple test hashes. – IronWaffleMan Sep 20 '16 at 00:02
  • If you are going to be calculating the closest using the hamming distance then bisect is definitely not going to work. Sorting is irrelevant for a hamming distance check. It might be worth asking the guys that wrote the lib for their input to see exactly what is going on with the hashing. – Padraic Cunningham Sep 20 '16 at 00:09
  • Are the hashes actually just four hex digits long, or did you shorten them just for showing us? – Stefan Pochmann Sep 20 '16 at 00:09
  • @StefanPochmann They're actually 48 digits long, I shortened them for brevity's sake. PadraicCunningham, I'm starting to realise that the sorting is useless, yeah. I thought I could treat the hashes as numbers and run with that, but clearly I was mistaken. More research is required. – IronWaffleMan Sep 20 '16 at 00:13
  • I'd actually show real hashes then, otherwise it's misleading. With four digits, you could easily just do a BFS. But not with 48. – Stefan Pochmann Sep 20 '16 at 00:16
  • I think what you really need is some tree structure, you may find something useful here http://stackoverflow.com/questions/6389841/efficiently-find-binary-strings-with-low-hamming-distance-in-large-set. – Padraic Cunningham Sep 20 '16 at 00:21
  • OK, I'll make another question specifically asking about hamming distance with fixed 48-digit hexadecimals, this question doesn't really apply anymore. – IronWaffleMan Sep 20 '16 at 02:50

1 Answers1

2
def takeClosestEx(myList, myNumber):
    """
    Assumes myList is not sorted. Returns closest value to myNumber.
    """

    return min(myList, key=lambda x:abs(int(x["sortedProp"], 16)-myNumber))
Alex M
  • 2,756
  • 7
  • 29
  • 35
AnkMat
  • 21
  • 1