I have an algorithmic problem at hand. To easily explain the problem, I will be using a simple analogy. I have an input file
Country,Exports
Austrailia,Sheep
US, Apple
Austrialia,Beef
End Goal: I have to find the common products between the pairs of countries so
{"Austrailia,New Zealand"}:{"apple","sheep}
{"Austrialia,US"}:{"apple"}
{"New Zealand","US"}:{"apple","milk"}
Process :
I read in the input and store it in a TreeMap > Where the List, the strings are interned due to many duplicates. Essentially, I am aggregating by country. where Key is country, Values are its Exports.
{"austrailia":{"apple","sheep","koalas"}}
{"new zealand":{"apple","sheep","milk"}}
{"US":{"apple","beef","milk"}}
I have about 1200 keys (countries) and total number of values(exports) is 80 million altogether. I sort all the values of each key:
{"austrailia":{"apple","sheep","koalas"}} -- > {"austrailia":{"apple","koalas","sheep"}}
This is fast as there are only 1200 Lists to sort.
for(k1:keys)
for(k2:keys)
if(k1.compareTo(k2) <0){ //Dont want to double compare
List<String> intersectList = intersectList_func(k1's exports,k2's exports);
countriespair.put({k1,k2},intersectList)
}
This code block takes so long.I realise it O(n2) and around 1200*1200 comparisions.Thus,Running for almost 3 hours till now.. Is there any way, I can speed it up or optimise it. Algorithm wise is best option, or are there other technologies to consider.
Edit: Since both List are sorted beforehand, the intersectList is O(n) where n is length of floor(listOne.length,listTwo.length) and NOT O(n2) as discussed below
private static List<String> intersectList(List<String> listOne,List<String> listTwo){
int i=0,j=0;
List<String> listResult = new LinkedList<String>();
while(i!=listOne.size() && j!=listTwo.size()){
int compareVal = listOne.get(i).compareTo(listTwo.get(j));
if(compareVal==0){
listResult.add(listOne.get(i));
i++;j++;} }
else if(compareVal < 0) i++;
else if (compareVal >0) j++;
}
return listResult;
}
Update 22 Nov My current implementation is still running for almost 18 hours. :|
Update 25 Nov I had run the new implementation as suggested by Vikram and a few others. It's been running this Friday. My question, is that how does grouping by exports rather than country save computational complexity. I find that the complexity is the same. As Groo mentioned, I find that the complexity for the second part is O(E*C^2) where is E is exports and C is country.