First of all, you need to sort the Humans into buckets by monthOfBirth
+ country
. That should be pretty cheap to do - just iterate through them all, popping each one into the appropriate bucket.
Note that appending Strings is the "hacky" way to approach this. The "proper" way would be to create a key object with a proper hashCode method:
public class MonthCountryKey {
String monthOfBirth;
String country;
// <snip> constructor, setters
@Override public int hashCode() {
return Arrays.hashCode(new Object[] {
monthOfBirth,
country,
});
}
@Override public boolean equals(Object o) {
...
}
}
See: What is a best practice of writing hash function in java?
Map<MonthCountryKey,List<Human>> buckets = new HashMap<List<Human>>;
while(Human human = humanSource.get()) {
MonthCountryKey key = new MonthCountryKey(human.getMonthOfBirth(), human.getCountry());
List list = buckets.get(key);
if(list == null) {
list = new ArrayList<Human>();
buckets.put(key,list);
}
list.add(human);
}
Note that there are other kinds of Set. For example, new TreeSet(monthCountryHumanComparator)
-- with Apache BeanUtils new TreeSet(new BeanComparator("monthOfBirth.country"))
!
If there are really lots of humans, it could be worth storing the buckets in a database - SQL or otherwise, as you see fit. You just need to be able to get them reasonably quickly by bucket and list index number.
Then you can apply a hobby matching algorithm to each bucket in turn, reducing the scale of the brute force search dramatically.
I can't see a way to avoid comparing every Human in a bucket to every other Human in the same bucket, but you could do some work to make the comparisons cheap.
Consider encoding the hobbies as an integer; one bit per hobby. A long gives you up to 64 hobbies. If you need more, you will need more integers or a BigInteger (benchmark both approaches). You could build up the dictionary of bit positions to hobbies as you work through the Humans and encounter new hobbies. Comparing two sets of hobbies is then a cheap binary '&' followed by a Long.bitCount().
To illustrate, the first human has hobbies [ "cooking", "cinema" ]
So the right hand bit is "cooking", the next bit to the left is "cinema" and this human's encoded hobbies are binary {60 zeroes}00011 == 3
Next human likes [ "cooking", "fishing" ]
So fishing
gets added to the dictionary and this human's encoded hobbies are {60 zeroes}0101 = 5
public long encodeHobbies(List<String> hobbies, BitPositionDictionary dict) {
long encoded = 0;
for(String hobby : hobbies) {
int pos = dict.getPosition(hobby); // if not found, allocates new
encoded &= (1 << pos)
}
return encoded;
}
... with...
public class BitPositionDictionary {
private Map<String,Integer> positions = new HashMap<String,Integer>();
private int nextPosition;
public int getPosition(String s) {
Integer i = positions.get(s);
if(i == null) {
i = nextPosition;
positions.put(i,s);
nextPosition++;
}
return i;
}
}
Binary & them to get {60 zeroes}0001; Long.bitCount(1) == 1. These two humans have one hobby in common.
To handle your third human: [ "fishing", "clubbing", "chess" ], your costs are:
- Adding to the hobby->bit position dictionary and encoding to integer(s)
- Comparing against all the binary-encoded hobby strings created thus far
You'll want to store your binary encoded hobbies somewhere really cheap to access. I'd be tempted to just use an array of long, with a corresponding index of humans:
long[] hobbies = new long[numHumans];
int size = 0;
for(int i = 0; i<numHumans; i++) {
hobby = encodeHobbies(humans.get(i).getHobbies(),
bitPositionDictionary);
for(int j = 0; j<size; j++) {
if(enoughBitsInCommon(hobbies[j], hobby)) {
// just record somewhere cheap for later processing
handleMatch(i,j);
}
}
hobbies[size++] = hobby;
}
With...
// Clearly this could be extended to encodings of more than one long
static boolean enoughBitsInCommon(long x, long y) {
int numHobbiesX = Long.bitCount(x);
int hobbiesInCommon = Long.bitCount(x & y);
// used 128 in the hope that compiler will optimise!
return ((hobbiesInCommon * 128) / numHobbiesX ) > MATCH_THRESHOLD;
}
This way, if there are few enough hobby types to keep in a long, you can keep 168 million sets of hobbies in a 1GB array :)
It should be blisteringly fast; I reckon RAM access time is the bottleneck here. But it is a brute force search, and continues to be O(n2)
If you're talking about really huge datasets, I suspect this approach would be amenable to distributed processing with MapReduce or whatever.
Additional notes: you could use a BitSet instead of long(s), and get a bit more expressiveness; perhaps at the cost of some performance. Again, benchmark.
long x,y;
...
int numMatches = Long.bitCount(x & y);
... becomes
BitSet x,y;
...
int numMatches = x.and(y).cardinality();
The number of positions at which two strings differ is called the Hamming distance, and there is an answered question on cstheory.so about searching for pairs with a close Hamming distance: https://cstheory.stackexchange.com/questions/18516/find-all-pairs-of-values-that-are-close-under-hamming-distance -- from what I understand of the accepted answer, it's an approach which will find a "very high proportion" of matches, not all, which I guess does require a brute-force search.