1

Does anyone know if there is a good way to create a map from string to string that has approximate string keys? That is, if I do the following:

map.put("Fuzzy", "string")
map.put("Fuzy", "bear")

I want the resulting map to be:

[ "Fuzzy":{ "string", "bear" } ]

(There might also be something in there to note that "bear" came from "Fuzy", but that is a secondary concern). Of course, the amount of approximation (distance) between strings would probably be a parameter. In this case, the distance is 1, but it could be more or less.

As far as I can tell, a Trie might be a good place to start, but I didn't want to implement something and find it has already been done.

Of course, the naive solution is just to loop over all the keys in the map, but I'm hoping for better efficiency than that.

Thanks!

mayhewsw
  • 704
  • 9
  • 20

3 Answers3

1

Here's a FuzzyHashMap implementation, although I haven't tried it:

http://sourceforge.net/projects/fuzzyhashmap/

as well as an implementation of a BK-Tree which looks related:

https://code.google.com/p/java-bk-tree/

John Lehmann
  • 7,975
  • 4
  • 58
  • 71
  • The problem with the FuzzyHashMap is that if the "fuzziness" is in the first four characters, it won't work. It also has a Soundex functionality, but this has the same problem. The BK-tree looks interesting though. – mayhewsw Nov 02 '13 at 20:06
1

I would suggest to implement hashCode and equals function so that they return the Soundex of the objects you are storing in the map.

Then you should be able to look up words quite quickly.

UPDATE: I just noticed, that it looks like we are talking python: so you have to override the __hash__ function AFAIK (There is also a good post on how to implement hashmaps in python)

Community
  • 1
  • 1
devsnd
  • 7,382
  • 3
  • 42
  • 50
  • I hadn't heard of Soundex before - that's a good idea! I'll try that. I'm actually using Java, I just like the Python output of maps. – mayhewsw Mar 20 '12 at 12:40
0

I has a similar requirement, so I implemented my own HashMap.

In my requirement, the keys were exact while inserting, but errors were possible in the search string.

My hash function:

first part of hashcode stores the length of the key second part of hashcode stores the sum of all the characters of the key

the bit widths of these two parts is fixed. So we get one bucket each for a given key length. First bucket stores keys of length 1, Second bucket stores keys of length 2, and so on

Now when find() is called, 1. it checks for exact match. if found, return. else, goto next step. 2. 3 possibilities of errors exist: a distorted char, a missing char, an additional char 3. check for distorted char. distortions don't change length, so we need to search the same bucket. if one char is distorted, the hash value can increase or decrease by MAX_CHAR_CODE at max. So from the location of expected hashcode, search MAX_CHAR_CODE indices backwards and forwards. Most values will be NULL. When a non-NULL value is found, compare the keys while allowing for distortion of one char. 4. check for a missing char. if a char is missing, the new key length will be in smaller than the actual one. So we need to search in the next bucket. The sum part of hashcode would have decreased by at max MAX_CHAR_CODE. So search from the current position in the next bucket, MAX_CHAR_CODE places forwards. When a non-NULL value is found, compare the keys while allowing for one missing char. 5. Additional char. Very similar to 4.