1

I need an algorithm to find matching pairs of objects in a list. This is an example case:

class Human 
{
   int ID;
   string monthOfBirth;
   string country;
   string [] hobbies = {};
}

There is a large list of humans and the problem is to find matching human pairs, and this needs to be done efficiently because the lists are huge.

Matching criteria:

  1. Birth month and country must match exactly
  2. Both should have more than x% of hobbies matching.

Because of (2) criteria we can't do an exact equals comparison.

The ways I can think of are:

  1. Brute force - Compare each object with every other object. Complexity O(n^2)
  2. Hash table

For the hash table approach, I'm considering the following way:

  1. Create a hash set of <String, List<Human>> (Or a MultiMap)
  2. Concatenate birth month and country of each Human to one string
  3. Use this concatenated string to hash to the HashSet (two humans having same birth month and country must give the same hash code)
  4. If there is already an element, compare the hobbies for x% of match
  5. If matching, then this is a duplicate
  6. If hobbies do not match more than x%, then add this human (linked list approach)

Is there a better way to do this?

Does it make sense to concatenate the month and country? The list would be large, so I'm assuming, 'better' would mean by the amount of storage, not the execution speed.

Martin Hennings
  • 16,418
  • 9
  • 48
  • 68
madu
  • 5,232
  • 14
  • 56
  • 96
  • 2
    How big does the hobbies list generally get? Is it variable length and if so does it have a max length? – QuestionC Mar 11 '14 at 14:47
  • 1
    Is the list of countries and hobbies known ahead of time? You could use that to reduce the memory usage, and perhaps use better data structures (hit, `HashMap` isn't the only implementation of the `Map` interface). – SimonC Mar 11 '14 at 14:48
  • Thank you for the replies. There are no exact limits, but it should be assumes that it can be large. Hobbies is only an example, it can be another list of strings which can be quite large. But it is possible to set a max length. How would that change the algorithm? – madu Mar 11 '14 at 14:57
  • The countries and hobbies are NOT known in advance. But if they are, how could that be used to reduce the memory usage? – madu Mar 11 '14 at 14:58
  • 1
    For example, you could use a tree structure to store your `Human`s. The first level could be an array of length 12, which each slot corresponded to a birth month, `0` for January, `1` for February, and so on. This way you're not storing the string representation of the birth month in your data structure. The next level of the tree could then be country, and so on. – SimonC Mar 11 '14 at 15:08
  • There are only 12 values for monthOfBirth, so have you considered an array of hashes? – pjs Mar 11 '14 at 16:45
  • 1
    @pjs I wasn't sure whether "monthOfBirth" might be shorthand for "date of birth to the nearest month". Otherwise, it should be an enum. – slim Mar 13 '14 at 11:48
  • 1
    @slim Then use a hash of hashes. – pjs Mar 13 '14 at 13:46

3 Answers3

3

First of all, you need to sort the Humans into buckets by monthOfBirth + country. That should be pretty cheap to do - just iterate through them all, popping each one into the appropriate bucket.

Note that appending Strings is the "hacky" way to approach this. The "proper" way would be to create a key object with a proper hashCode method:

 public class MonthCountryKey {
     String monthOfBirth;
     String country;
     // <snip> constructor, setters 
     @Override public int hashCode() {
         return Arrays.hashCode(new Object[] {
            monthOfBirth, 
            country,
         });
     }
     @Override public boolean equals(Object o) {
         ...
     }
 }

See: What is a best practice of writing hash function in java?

Map<MonthCountryKey,List<Human>> buckets = new HashMap<List<Human>>;

while(Human human = humanSource.get()) {
    MonthCountryKey key = new MonthCountryKey(human.getMonthOfBirth(), human.getCountry());
    List list = buckets.get(key);
    if(list == null) {
       list = new ArrayList<Human>();
       buckets.put(key,list);
    }
    list.add(human);
}

Note that there are other kinds of Set. For example, new TreeSet(monthCountryHumanComparator) -- with Apache BeanUtils new TreeSet(new BeanComparator("monthOfBirth.country"))!

If there are really lots of humans, it could be worth storing the buckets in a database - SQL or otherwise, as you see fit. You just need to be able to get them reasonably quickly by bucket and list index number.

Then you can apply a hobby matching algorithm to each bucket in turn, reducing the scale of the brute force search dramatically.

I can't see a way to avoid comparing every Human in a bucket to every other Human in the same bucket, but you could do some work to make the comparisons cheap.

Consider encoding the hobbies as an integer; one bit per hobby. A long gives you up to 64 hobbies. If you need more, you will need more integers or a BigInteger (benchmark both approaches). You could build up the dictionary of bit positions to hobbies as you work through the Humans and encounter new hobbies. Comparing two sets of hobbies is then a cheap binary '&' followed by a Long.bitCount().

To illustrate, the first human has hobbies [ "cooking", "cinema" ]

So the right hand bit is "cooking", the next bit to the left is "cinema" and this human's encoded hobbies are binary {60 zeroes}00011 == 3

Next human likes [ "cooking", "fishing" ]

So fishing gets added to the dictionary and this human's encoded hobbies are {60 zeroes}0101 = 5

 public long encodeHobbies(List<String> hobbies, BitPositionDictionary dict) {
      long encoded = 0;
      for(String hobby : hobbies) {
          int pos = dict.getPosition(hobby); // if not found, allocates new
          encoded &= (1 << pos)
      }
      return encoded;
 }

... with...

 public class BitPositionDictionary {
     private Map<String,Integer> positions = new HashMap<String,Integer>();
     private int nextPosition;
     public int getPosition(String s) {
         Integer i = positions.get(s);
         if(i == null) {
             i = nextPosition;
             positions.put(i,s);
             nextPosition++;
         }
         return i;
     }
 }

Binary & them to get {60 zeroes}0001; Long.bitCount(1) == 1. These two humans have one hobby in common.

To handle your third human: [ "fishing", "clubbing", "chess" ], your costs are:

  • Adding to the hobby->bit position dictionary and encoding to integer(s)
  • Comparing against all the binary-encoded hobby strings created thus far

You'll want to store your binary encoded hobbies somewhere really cheap to access. I'd be tempted to just use an array of long, with a corresponding index of humans:

  long[] hobbies = new long[numHumans];
  int size = 0;
  for(int i = 0; i<numHumans; i++) {
      hobby = encodeHobbies(humans.get(i).getHobbies(),
                             bitPositionDictionary);
      for(int j = 0; j<size; j++) {
          if(enoughBitsInCommon(hobbies[j], hobby)) {
              // just record somewhere cheap for later processing
              handleMatch(i,j); 
          }
      }
      hobbies[size++] = hobby;
  }

With...

  // Clearly this could be extended to encodings of more than one long
  static boolean enoughBitsInCommon(long x, long y) {
      int numHobbiesX = Long.bitCount(x);
      int hobbiesInCommon = Long.bitCount(x & y);
      // used 128 in the hope that compiler will optimise!
      return ((hobbiesInCommon * 128) / numHobbiesX ) > MATCH_THRESHOLD;
  }

This way, if there are few enough hobby types to keep in a long, you can keep 168 million sets of hobbies in a 1GB array :)

It should be blisteringly fast; I reckon RAM access time is the bottleneck here. But it is a brute force search, and continues to be O(n2)

If you're talking about really huge datasets, I suspect this approach would be amenable to distributed processing with MapReduce or whatever.


Additional notes: you could use a BitSet instead of long(s), and get a bit more expressiveness; perhaps at the cost of some performance. Again, benchmark.

  long x,y;
  ...
  int numMatches = Long.bitCount(x & y);

  ... becomes

  BitSet x,y;
  ...
  int numMatches = x.and(y).cardinality();

The number of positions at which two strings differ is called the Hamming distance, and there is an answered question on cstheory.so about searching for pairs with a close Hamming distance: https://cstheory.stackexchange.com/questions/18516/find-all-pairs-of-values-that-are-close-under-hamming-distance -- from what I understand of the accepted answer, it's an approach which will find a "very high proportion" of matches, not all, which I guess does require a brute-force search.

Community
  • 1
  • 1
slim
  • 40,215
  • 13
  • 94
  • 127
  • +1 for the concept. Actualy i was thinking the same with use of BitSet for encodeHobies. Then i was thought of suggesting to keep enum/static constant array for country and month. instead of holding String as part of Human, he can hold integer to represent the product of index of month and country . – Mani Mar 11 '14 at 21:00
  • 1
    I tried the `BitCount(x & y) > n` brute force search on random arrays of `long`, on a 2.5GHz Intel CPU. It took (array size, time) : (2^16,2215ms) (2^17,9002ms), (2^18, 37349ms). So yes it grows exponentially but it's fast for quite big sets. For random input and low values of `n`, storing the matches became expensive. I guess real-world hobby data would have fewer matches than random longs. – slim Mar 13 '14 at 13:02
  • 1
    just a note * , BitSet uses long to store the bit information internally. – Mani Mar 13 '14 at 17:33
1

Hash would generally be the way to go. Rather than concatenating month and country, you could cheat and just add the hashcodes of these two values together to form a combined hashcode; that'd save you some processing effort and memory use. You could also define .equals() for the record to implement the match logic you've described, which would let a hashset directly check whether a matching entry exists.

keshlam
  • 7,931
  • 2
  • 19
  • 33
  • But if I add the two hash codes of month and country together, isn't there a possibility that two different combinations of month and country give the same hash? So that two different humans have same hash code? – madu Mar 11 '14 at 15:00
  • 1
    Has codes are not unique, and don't have to be as long as they collide _rarely_ -- which will be true in this case. Hashtables know how to deal with this as long as .equals() is also implemented properly. – keshlam Mar 11 '14 at 17:07
  • 1
    It has been recommended that you use `31 * hashcode(string1) + hashcode(string2)` -- however it's not the end of the world if there's a hash collision. Algorithms that use the hashcode always then use equals() to double check. – slim Mar 11 '14 at 17:09
1

This result assumes that you can write a brute force method. There is room for optimization, but in general this is the right algorithm.

FindMatches (std::vector <Human> const & input, back_insert_iterator<vector> result)
{
  typedef std::pair <std::string, std::string> key_type;
  typedef std::vector <Human> Human_collection;

  typedef std::map <key_type, Human_collection> map_type;

  map_type my_map;

  for (ci = input.begin(); ci != input.end(); ++ci)
  {
    key_type my_key(ci->monthOfBirth, ci->country);

    my_map[my_key].push_back(*ci);
  }

  // Each value of my_map is now a collection of humans sharing the same birth statistics, which is the key.
  for (ci = my_map.begin(); ci != my_map.end(); ++ci)
  {
    FindMatches_BruteForce (ci->second, result);
  }

  return;
}

There's a lot of possible room for efficiency here like you could copy around pointers of full objects, or use some other data struct than a map, or just sort the input container in-place. But algorithmicaly, I believe this is as good as it gets.

QuestionC
  • 10,006
  • 4
  • 26
  • 44