PairWise matching millions of records

Question

I have an algorithmic problem at hand. To easily explain the problem, I will be using a simple analogy. I have an input file

Country,Exports
Austrailia,Sheep
US, Apple
Austrialia,Beef

End Goal: I have to find the common products between the pairs of countries so

{"Austrailia,New Zealand"}:{"apple","sheep}
{"Austrialia,US"}:{"apple"}
{"New Zealand","US"}:{"apple","milk"}

Process :

I read in the input and store it in a TreeMap > Where the List, the strings are interned due to many duplicates. Essentially, I am aggregating by country. where Key is country, Values are its Exports.

{"austrailia":{"apple","sheep","koalas"}}
{"new zealand":{"apple","sheep","milk"}}
{"US":{"apple","beef","milk"}}

I have about 1200 keys (countries) and total number of values(exports) is 80 million altogether. I sort all the values of each key:

{"austrailia":{"apple","sheep","koalas"}} -- > {"austrailia":{"apple","koalas","sheep"}}

This is fast as there are only 1200 Lists to sort.

for(k1:keys)
   for(k2:keys)
        if(k1.compareTo(k2) <0){ //Dont want to double compare
    List<String> intersectList = intersectList_func(k1's exports,k2's exports);
        countriespair.put({k1,k2},intersectList)
}

This code block takes so long.I realise it O(n2) and around 1200*1200 comparisions.Thus,Running for almost 3 hours till now.. Is there any way, I can speed it up or optimise it. Algorithm wise is best option, or are there other technologies to consider.

Edit: Since both List are sorted beforehand, the intersectList is O(n) where n is length of floor(listOne.length,listTwo.length) and NOT O(n2) as discussed below

private static List<String> intersectList(List<String> listOne,List<String> listTwo){
        int i=0,j=0;
        List<String> listResult = new LinkedList<String>(); 
        while(i!=listOne.size() && j!=listTwo.size()){
            int compareVal = listOne.get(i).compareTo(listTwo.get(j));
            if(compareVal==0){
                listResult.add(listOne.get(i));
                i++;j++;}               }
            else if(compareVal < 0) i++;
            else if (compareVal >0) j++;   
        }
        return listResult;
    }

Update 22 Nov My current implementation is still running for almost 18 hours. :|

Update 25 Nov I had run the new implementation as suggested by Vikram and a few others. It's been running this Friday. My question, is that how does grouping by exports rather than country save computational complexity. I find that the complexity is the same. As Groo mentioned, I find that the complexity for the second part is O(E*C^2) where is E is exports and C is country.

Use a SQL DB and query would be a possible solution. – Ortwin Angermeier Nov 21 '13 at 09:18 — Ortwin Angermeier, Nov 21 '13 at 09:18
@prog_guy Give your input file to test my code on – Vikram Bhat Nov 26 '13 at 09:28 — Vikram Bhat, Nov 26 '13 at 09:28

Vikram Bhat · Answer 1 · 2013-11-26T09:56:38.103

Store something like following datastructure:- (following is a pseudo code)

ValuesSet ={
apple = {"Austrailia","New Zealand"..}
sheep = {"Austrailia","New Zealand"..}  

}

for k in ValuesSet 
    for k1 in k.values() 
       for k2 in k.values()   
           if(k1<k2)
              Set(k1,k2).add(k)

time complextiy: O(No of distinct pairs with similar products)

Note: I might be wrong but i donot think u can reduce this time complexity

Following is a java implementation for your problem:-

public class PairMatching {

    HashMap Country;
    ArrayList CountNames;
    HashMap ProdtoIndex;
    ArrayList ProdtoCount;
    ArrayList ProdNames;
    ArrayList[][] Pairs;

    int products=0;
    int countries=0;


    public void readfile(String filename) {
        try {
            BufferedReader br = new BufferedReader(new FileReader(new File(filename)));
            String line;
            CountNames = new ArrayList();
            Country = new HashMap<String,Integer>();
            ProdtoIndex = new HashMap<String,Integer>();
            ProdtoCount = new ArrayList<ArrayList>();
            ProdNames = new ArrayList();
            products = countries = 0;
            while((line=br.readLine())!=null) {
                String[] s = line.split(",");
                s[0] = s[0].trim();
                s[1] = s[1].trim();
                int k;
                if(!Country.containsKey(s[0])) {
                    CountNames.add(s[0]);
                    Country.put(s[0],countries);
                    k = countries;
                    countries++;
                } 
                else {
                    k =(Integer) Country.get(s[0]);
                }
                if(!ProdtoIndex.containsKey(s[1])) {
                    ProdNames.add(s[1]);
                    ArrayList n = new ArrayList();
                    ProdtoIndex.put(s[1],products);
                    n.add(k);
                    ProdtoCount.add(n);
                    products++;
                }
                else {
                    int ind =(Integer)ProdtoIndex.get(s[1]);
                    ArrayList c =(ArrayList) ProdtoCount.get(ind);
                    c.add(k);
                }
            }
            System.out.println(CountNames);
            System.out.println(ProdtoCount);
            System.out.println(ProdNames);

        } catch (FileNotFoundException ex) {
            Logger.getLogger(PairMatching.class.getName()).log(Level.SEVERE, null, ex);
        } catch (IOException ex) {
            Logger.getLogger(PairMatching.class.getName()).log(Level.SEVERE, null, ex);
        }


    }

    void FindPairs() {
        Pairs = new ArrayList[countries][countries];
        for(int i=0;i<ProdNames.size();i++) {
            ArrayList curr = (ArrayList)ProdtoCount.get(i);
            for(int j=0;j<curr.size();j++) {
                for(int k=j+1;k<curr.size();k++) {
                    int u =(Integer)curr.get(j);
                    int v = (Integer)curr.get(k);
                    //System.out.println(u+","+v);
                    if(Pairs[u][v]==null) {
                        if(Pairs[v][u]!=null)
                            Pairs[v][u].add(i);
                        else {
                            Pairs[u][v] = new ArrayList();
                            Pairs[u][v].add(i);
                        }
                    }
                    else Pairs[u][v].add(i);
                }
            }
        }
        for(int i=0;i<countries;i++) {
            for(int j=0;j<countries;j++) {
                if(Pairs[i][j]==null)
                    continue;
                ArrayList a = Pairs[i][j];
                System.out.print("\n{"+CountNames.get(i)+","+CountNames.get(j)+"} : ");
                for(int k=0;k<a.size();k++) {
                    System.out.print(ProdNames.get((Integer)a.get(k))+" ");
                }
            }
        }
    }

    public static void main(String[] args) {
       PairMatching pm = new PairMatching();
       pm.readfile("Input data/BigData.txt");
       pm.FindPairs();


    }

}

The for loop is quite misleading. for k1 in k ( shld be for k1 in k.values() and likewise for k2 as well. — prog_guy, Nov 22 '13 at 08:58

score 2 · Accepted Answer · answered Nov 21 '13 at 14:43

This can be done in one statement as a self-join using SQL:

test data. First create a test data set:

Lines <- "Country,Exports
Austrailia,Sheep
Austrailia,Apple
New Zealand,Apple
New Zealand,Sheep
New Zealand,Milk
US,Apple
US,Milk
"
DF <- read.csv(text = Lines, as.is = TRUE)

sqldf Now that we have DF issue this command:

library(sqldf)
sqldf("select a.Country, b.Country, group_concat(Exports) Exports
   from DF a, DF b using (Exports) 
   where a.Country < b.Country
   group by a.Country, b.Country
")

giving this output:

      Country     Country     Exports
1  Austrailia New Zealand Sheep,Apple
2  Austrailia          US       Apple
3 New Zealand          US  Apple,Milk

with index If its too slow add an index to the Country column (and be sure not to forget the main. parts:

sqldf(c("create index idx on DF(Country)",
   "select a.Country, b.Country, group_concat(Exports) Exports
   from main.DF a, main.DF b using (Exports) 
   where a.Country < b.Country
   group by a.Country, b.Country
"))

If you run out memory then add the dbname = tempfile() sqldf argument so that it uses disk.

Nice solution. Not really a DB guy, so a few qn. ............ 1)I remember using (col_name) joins only works when both cols are equal, how did it work by intersection ?............ .2)What is the last line of group by for ? By removing it, there was only one row returned. I have so far used, group by for aggregating results, this somehow does vice-versa. ............ I havent yet benchmarked it with the entire dataset. — prog_guy, Nov 22 '13 at 03:15
(1) Right, we are joining by `Exports` and those must be equal in the surviving country pairs. (2) We are performing the `group_concat` aggregation over the groups defined by `group by`. — G. Grothendieck, Nov 22 '13 at 04:19

vgru · Answer 3 · 2013-11-26T08:46:48.247

[Update] The algorithm presented here shouldn't improve time complexity compared to the OP's original algorithm. Both algorithms have the same asymptotic complexity, and iterating through sorted lists (as OP does) should generally perform better than using a hash table.

You need to group the items by product, not by country, in order to be able to quickly fetch all countries belonging to a certain product.

This would be the pseudocode:

inputList contains a list of pairs {country, product}

// group by product 
prepare mapA (product) => (list_of_countries)
for each {country, product} in inputList
{      
   if mapA does not contain (product)
      create a new empty (list_of_countries) 
      and add it to mapA with (product) as key

   add this (country) to the (list_of_countries)
}

// now group by country_pair  
prepare mapB (country_pair) => (list_of_products)       
for each {product, list_of_countries} in mapA
{   
   for each pair {countryA, countryB} in list_of_countries
   {
      if mapB does not countain country_pair {countryA, countryB}
         create a new empty (list_of_products) 
         and add it to mapB with country_pair {countryA, countryB} as key

      add this (product) to the (list_of_products)
   }
}

If your input list is length N, and you have C distinct countries and P distinct products, then the running time of this algorithm should be O(N) for the first part and O(P*C^2) for the second part. Since your final list needs to have pairs of countries mapping to lists of products, I don't think you will be able to lose the P*C^2 complexity in any case.

I don't code in Java too much, so I added a C# example which I believe you'll be able to port pretty easily:

// mapA maps each product to a list of countries
var mapA = new Dictionary<string, List<string>>();
foreach (var t in inputList)
{
    List<string> countries = null;
    if (!mapA.TryGetValue(t.Product, out countries))
    {
        countries = new List<string>();
        mapA[t.Product] = countries;
    }
    countries.Add(t.Country);
}

// note (this is very important):
// CountryPair tuple must have value-type comparison semantics, 
// i.e. you need to ensure that two CountryPairs are compared
// by value to allow hashing (mapping) to work correctly, in O(1).

// In C# you can also simply use a Tuple<string,string> to 
// represent a pair of countries (which implements this correctly),
// but I used a custom class to emphasize the algorithm

// mapB maps each CountryPair to a list of products
var mapB = new Dictionary<CountryPair, List<string>>();
foreach (var kvp in mapA)
{
    var product = kvp.Key;
    var countries = kvp.Value;

    for (int i = 0; i < countries.Count; i++)
    {
        for (int j = i + 1; j < countries.Count; j++)
        {
            var pair = CountryPair.Create(countries[i], countries[j]);
            List<string> productsForCountryPair = null;
            if (!mapB.TryGetValue(pair, out productsForCountryPair))
            {
                productsForCountryPair = new List<string>();
                mapB[pair] = productsForCountryPair;
            }
            productsForCountryPair.Add(product);
        }*
    }
}

I had run this new implementation where I group by exports rather than Country. It's been running since Friday evening. My Question, is how does this new implementation reduce computational complexity. I find that the complexity is almost same. The only difference between the intersection between the lists. Correct me if I am wrong. — prog_guy, Nov 25 '13 at 08:29
The most important part of this algorithm is to implement the `CountryPair` equality/hashing functionality properly. This means you will have to [override `equals`/`getHashCode`](http://stackoverflow.com/questions/27581/overriding-equals-and-hashcode-in-java) and ensure that your hash function does not provide too many collisions. You can also use any tuple class which implements equality this way (e.g. [javatuples](http://www.javatuples.org/)). If your hash function is written poorly, then you will lose `O(1)` lookup for the country pair map, and time will degrade. — vgru, Nov 25 '13 at 13:29
@prog_guy: I have also edit my answer with some complexity analysis, so you can check if my numbers are correct. — vgru, Nov 25 '13 at 14:31
*You have around 65,000 distinct products (80M / 1.2k = 66.6k)* , I have around 3.7 million distinct products and not 66K. I still dont agree why my intersect method is O(n2), I still think it is O(n) .. For the CountryPair class, which has 2 strings, I have used the generic Objects.hash(country1,country2) for now.. Will also have a look at the javatuples. — prog_guy, Nov 26 '13 at 03:03
@prog_guy: yes, you're correct, intersect iterates both lists until the end in the worst case, which is `O(2n) = O(n)`. This indeed makes both complexities equal (unless you account for the sorting part, `O(n log n)`, which is quick to execute), I am sorry I didn't analyze your code better the first time. Your code might even exhibit better locality of reference, but the allocations are something that you could reduce by using an `ArrayList` instead of a `LinkedList`. — vgru, Nov 26 '13 at 08:32

score 0 · Answer 4 · answered Nov 21 '13 at 09:24

0

This is a great example to use Map Reduce.

At your map phase you just collect all the exports that belong to each Country.
Then, the reducer sorts the products (Products belong to the same country, because of mapper)

You will benefit from distributed, parallel algorithm that can be distributed into a cluster.

answered Nov 21 '13 at 09:24

istovatis

1,408
14
27

If he only has 1200 keys for 80 million records, MapReduce is the worst choice for that problem (well actually it is depending on the grouping algorithm and implementation used). Also 80 mio. is not that big at all you would need a distributed system for. – Thomas Jungblut Nov 21 '13 at 09:28
The great amount of records in contrary to the keys, is actually a problem for map reduce. The performance of the program will be improved with a proper design of grouping. Moreover the hardware(ie number of cores) is crucial factor. But overall this is a better approach than the initial solution given by prog_guy. – istovatis Nov 21 '13 at 09:34

score 0 · Answer 5 · answered Nov 21 '13 at 10:08

You are actually taking O(n^2 * time required for 1 intersect).

Lets see if we can improve time for intersect. We can maintain map for every country which stores corresponding products, so you have n hash maps for n countries. Just need to iterate thru all products once for initializing. If you want quick lookup, maintain a map of maps as:

    HashMap<String,HashMap<String,Boolean>> countryMap = new HashMap<String, HashMap<String,Boolean>>();

Now if you want to find the common products for countries str1 and str2 do:

    HashMap<String,Boolean> map1 = countryMap.get("str1");
    HashMap<String,Boolean> map2 = countryMap.get("str2");

    ArrayList<String > common = new ArrayList<String>();
    Iterator it = map1.entrySet().iterator();
    while (it.hasNext()) {
        Map.Entry<String,Boolean> pairs = (Map.Entry)it.next();

        //Add to common if it is there in other map
        if(map2.containsKey(pairs.getKey()))
            common.add(pairs.getKey());
    }

So, total it will be O(n^2 * k) if there are k entries in one map assuming hash map lookup implementation is O(1) (I guess it is log k for java).

The complexity is O(n), yes lookups are O(1) in Java. Your looping construct can be simplified to `map1.keySet().retainAll(map2.keySet())`. — Thomas Jungblut, Nov 21 '13 at 10:28
@ThomasJungblut There are O(n^2) combinations for n countries (nC2). If there are O(k) products per country, then overall complexity will be O(n^2 * k) since as you mentioned lookups are O(1) in java. — , Nov 21 '13 at 14:44

score 0 · Answer 6 · answered Nov 21 '13 at 11:04

0

Using hashmaps where necessary to speed things up:

1) Go through the data and create a map with keys Items and values a list of countries associated with that item. So e.g. Sheep:Australia, US, UK, New Zealand....

2) Create a hashmap with keys each pair of countries and (initially) an empty list as values.

3) For each Item retrieve the list of countries associated with it and for each pair of countries within that list, add that item to the list created for that pair in step (2).

4) Now output the updated list for each pair of countries.

The largest costs are in steps (3) and (4) and both of these costs are linear in the amount of output produced, so I think this is not too far from optimal.

answered Nov 21 '13 at 11:04

mcdowella

19,301
2
19
25

Its same as my answer in text – Vikram Bhat Nov 21 '13 at 11:42
So it is - I just saw a pile of nested for loops when I looked at it and didn't look much further. I think I have provided a little more detail in the claim that it is optimal. – mcdowella Nov 21 '13 at 12:34
its okay :), its always good for the answer that more than one people are supporting it. – Vikram Bhat Nov 21 '13 at 12:58

PairWise matching millions of records

6 Answers6