3

I have a csv file with names nearly 845k line.

I want to compare fuzzy name string matching. I used Java fuzzy string matching implementation of the well known Python's fuzzywuzzy algorithm.

Implemented below code it works perfect for me. The Problem is process time to much. Every line compare time is nearly 15 sec with other lines. This is 240 line for an hour and whole process will be nearly 6000 row. And all process will be finish in months. This is unacceptable working time.

I need an optimization technique or method. I need some suggestion rather than solution.

What you suggest for below code?

BufferedReader br = new BufferedReader(new FileReader("data/names.csv"));
BufferedWriter bw = new BufferedWriter(new FileWriter("data/similars.csv"));
ConcurrentHashMap<Integer,String> map = new ConcurrentHashMap<Integer,String>();

String lines;
while( (lines = br.readLine()) != null ){
    String[] line = lines.split("\\t",-1);
    Integer nameId = Integer.parseInt(line[0]);
    String name = line[1];
    map.put(nameId, name);
}

for (Map.Entry<Integer, String> entry1 : map.entrySet()) {
    Integer nameId1 = entry1.getKey();
    String name1 = entry1.getValue();

    for (Map.Entry<Integer, String> entry2 : map.entrySet()) {
        Integer nameId2 = entry2.getKey();
        if (nameId1 == nameId2) {
            continue;
        }
        String name2 = entry2.getValue();
        int ratio = FuzzySearch.ratio(name1,name2);
        if(ratio > 95){
            bw.write(nameId1 + "," + nameId2 + "\n");
        }
    }
    // For to prevent matching same pairs again 
    map.remove(nameId1);
}
Yilmazerhakan
  • 1,765
  • 1
  • 13
  • 21
  • 1
    How about just run this on several CPU or several servers in AWS? If I am right it should take about 3 days on 24 cores: ((845000*15/2)/60/60/24)/24 ~ 3.05 days. I think this is acceptable because you should do it once. – Maxim Jan 06 '17 at 22:57
  • @MaximDobryakov İt is my desktop pc with i7 cpu and 16 gb ram.win 10 os. – Yilmazerhakan Jan 06 '17 at 23:05

1 Answers1

3
  1. You can try instead Levenshtein distance algorithm, maybe it will give you better performance. Or try any other algorithm. Provide a link to algoritm implementation
  2. It's better not to compare Integer with ==, use nameId1.intValue() == nameId2
  3. Create N threads, where N is the number of cores. Put all your lines in the ConcurrentLinkedQueue. Let you threads poll the queue, take a word, make compassion, once finished - write to file under synchronized section. It will allow you to use all your cores on your PC, not only 1
  4. Why it takes so much time, maybe you have some memory restriction, which force GC to eat your CPU cycles and affect performance.
  5. You can apply some minor optimization, let's say if different between words length is more then 50%, you will never get 95% match
  6. Take a look at this implementation they are using threshold value, I believe it will give you max boost, I think algoritm will return earlier if distance is greater than threshold. Also check this question
Community
  • 1
  • 1
Anton
  • 5,831
  • 3
  • 35
  • 45
  • I would use more threads than cores, because two threads can still run on the same core. – NickL Jan 06 '17 at 23:09
  • Thank you Anton. I will try 2 and 5 quickly. For 1, I edited and linked fuzzywuzzy github library. It also based on levensthein and have some varieties like out of order words compare. I couldnt understand 2nd. sorry i rarely use java but NameIds also integer. i will search, learn and try 3. For 4 it is my desktop and gave properties in above answer's comments. – Yilmazerhakan Jan 06 '17 at 23:20
  • 2) is about that you shouldn't compare objects with ==, by calling intValue() on the Integer object, you compare the primitives. However, in this case because we are talking about the key of a hashmap, I think it is safe (and fastest even maybe?) to use == on the Integer objects. – NickL Jan 06 '17 at 23:28
  • @NickL 2) well if you compare two ints like this Integer a = 128; Integer b = 128; you will get false, if you comapre a = 16, b =16, you will get true. I would say no matter how you Integer are created just always compare it as primitive as a rule of thumb to avoid futher errors. Also you don't need to cast second Integer to int if you don't want too – Anton Jan 06 '17 at 23:38
  • I agree with you in general, but in this case it is different. The reason for the comparison is to not check the same entry in the hashmap to itself, using == is fine then. Two keys in a HashMap that are equal but not identical is by definition impossible (iff hashCode and equals are implemented correctly, which is the case for Integer.). – NickL Jan 06 '17 at 23:49
  • Also, un-boxing is not really good for performance.. (I.e. int == Integer). If you are 100% sure it is safe to check for reference equality, and performance is an issue, equality is the way to go imo. – NickL Jan 06 '17 at 23:58
  • Well in the example above comparing 240^2 integers will take less then 1 msec, the OP has other issues with perfomance ) – Anton Jan 06 '17 at 23:59
  • When it is done with 240 lines, it will have done that comparson 240*845.000 times, thats a bit more than 240^2. But yes, I agree, the whole problem here is the O(n^2) complexity I think. – NickL Jan 07 '17 at 00:07
  • @NickL i understood 2) with your comments. I had to fix a confuse about lines. İt has 845k lines. So initial scenario is 845k * 845k compare but i delete each map key after each outer loop. So complexity is O(n*logn) i think. – Yilmazerhakan Jan 07 '17 at 05:00
  • @Anton thanks for 6). A i liked threshold idea. Comparing distance much costed lines. My lines are not with same word count. So they may be out of orders wrong names, speelling error names, typo names. So used this library for these variaties. – Yilmazerhakan Jan 07 '17 at 05:36
  • @Yilmazerhakan n*logn is near-linear. I think by removing the line after comparison, you are deviding the work by 2. But O(0.5 * n^2) still grows quadratically, so it is still O(n^2), meaning that it will take a long time to complete for 850k lines.. which is the case. – NickL Jan 07 '17 at 11:19