3

I have two very big lists to compare. I compared them using retainAll() method and got the list of common elements. But I want to get the similar matches as well.

ArrayList<String> list1 = new ArrayList<String>(Arrays.asList("John","Mary"," Mr. John Marsh","Mrs. Mary Dsouza","abc","xyz"));
ArrayList<String> list2 = new ArrayList<String>(Arrays.asList("John","Mary","Tim","Sam"));
list1.retainAll( list2 );
System.out.println( list1 );

this gives me output [John, Mary]

I want similar matches as well like [John, Mary, Mr. John Marsh, Mrs. Mary Dsouza]

How to proceed? Just an idea will be sufficient.

Shashank
  • 712
  • 15
  • 33
  • Your sample result set given above has the flavor of a graph, since one element in the first list may be connected to a long chain of elements. – Tim Biegeleisen May 29 '15 at 03:16
  • so should I look for graphs algorithms? Any specific algo? – Shashank May 29 '15 at 03:22
  • I was just saying that for each entry in every list, you will have to compare against each entry in the other list and then reconcile that with the running groups you form. Tough problem. – Tim Biegeleisen May 29 '15 at 03:27

3 Answers3

3

Okay, Although i am scared to post this answer as i think it is very crude but still i will go ahead and post it. Fingers crossed :).

retainAll uses equals internally and since string is a final class we cannot manipulate it but we can create a wrapper around it and provide a custom equals implementation. But this adds to the space complexity.

Here is what i did (used contains in equals method).

public class FindAlike{


public static void main(String[] args) {
    ArrayList<StringWrapper> list1 = new ArrayList<StringWrapper>(Arrays.asList(new StringWrapper("John"),new StringWrapper("Mary")
    ,new StringWrapper(" Mr. John Marsh"),new StringWrapper("Mrs. Mary Dsouza"),new StringWrapper("abc"),new StringWrapper("xyz")));
    ArrayList<StringWrapper> list2 = new ArrayList<StringWrapper>(Arrays.asList(new StringWrapper("John"),new StringWrapper("Mary"),
            new StringWrapper("Tim"),new StringWrapper("Sam")));
    list1.retainAll( list2 );
    System.out.println( list1 );
}

private static class StringWrapper{

    private String value;

    public StringWrapper(String value) {
        this.value = value;
    }

    public String getValue(){
        return this.value;
    }

    @Override
    public boolean equals(Object obj) { 
        return this.value.contains(((StringWrapper)obj).getValue());
    }

    @Override
    public String toString() {
        return this.value;
    }

}
}

For the given data i got the following output - [John, Mary, Mr. John Marsh, Mrs. Mary Dsouza]

Ouney
  • 1,164
  • 1
  • 10
  • 22
  • Also, equals implementation is very basic you need to have many other checks in it. – Ouney May 29 '15 at 03:31
  • This is one way solution. If second `List` contain a value like `Tom abc`. Than it does not work. – Masudul May 29 '15 at 03:34
  • The implementation will vary as per OP's requirement. I mean you can modify contains like - this.value.contains(((StringWrapper)obj).getValue()) || ((StringWrapper)obj).getValue().contains(this.value) OR call list2.retainAll(list1) and then combine all the results. I just gave an idea and as i mentioned i am bit perplexed too :) – Ouney May 29 '15 at 03:42
0

Try This

for(String s1 : list1)
{
    for (String s2: list2)
    {
       if(s1.equals(s2) || s1.contains(s2) || s2.contains(s1))
       {
           list3.add(s1);
       }
    }

}

list3 gives you the elements which you need.

0

I guess you don't want to perform any symantic analysis on those strings. If its just a string comparision check this post and analyze those similarity algorithms.

I highlight those algorithms below (in case if that post is dead)

  • Cosine similarity
  • Jaccard similarity
  • Dice's coefficient
  • Matching similarity
  • Overlap similarity

I don't think you can reduce the number of iterations as it will always (should) be list1.length * list2.lenght. The only area that you can optimize is where you check for similarity. Also i would like to note that regex and contains operations are costly. So see if you can use one of the above algorithm in that place.

Please update us if you've come up with a better solution. Cheers!!

Community
  • 1
  • 1
Vivek
  • 3,523
  • 2
  • 25
  • 41