1

I am comparing two lists of strings to find possible matches. Example:

public class Tester {

    public static void main(String[] args) {

        List<String> test = new ArrayList<String>();
        List<String> test2 = new ArrayList<String>();

        test.add("3H0875AAAA0012");
        test.add("3H0875AABB0018");
        test.add("3H0875AAAC0010");
        test2.add("3H0875AA");


        for(String s2: test2){
            for (String s: test){
                if (s.matches(".*" + s2 + ".*")){
                    System.out.println("Match");
                }
            }
        }
    }
}

Basically for every string in test2 I want to see if there are any strings in test that contain test2 completely or partially. The output for the above code should be:

Match 
Match 
Match

However, in my real case scenario I have around 225K strings in test and around 5K strings in test2. It is taking too long process this comparison and wanted to see if it was possible to optimize the comparison. It takes about 10 minutes to analyze the first 1.5K items in test2. So it will take at least 30 to 40 minutes to finish the comparison.

Thanks in advance

Wilo Maldonado
  • 551
  • 2
  • 11
  • 22

4 Answers4

3

I think that you shouldn't use regex for that: I believe that looking into String#contains (here is a link to its javadoc entry) would give you better results, in terms of performance ;)

For example, your code could be:

for(final String s2: test2){
    for (final String s: test){
        if(s.contains(s2)) {
            System.out.println("Match");
        }
    }
}
ccjmne
  • 9,333
  • 3
  • 47
  • 62
1

IMHO methods like String.matches(String) should be forbidden. Maybe you need a regex match, maybe not, but what happens here, is that you string gets compiled into an regex... again and again.

So do yourself a favor and convert then all into regexes via Pattern.compile and reuse them.


Looking at your ".*" + s2 + ".*", I'd bet you need no regex at all. Simply use String.contains and enjoy the speed.

maaartinus
  • 44,714
  • 32
  • 161
  • 320
1

Instead of

s.matches(".*" + s2 + ".*")

you can use

s.contains(s2)

or

s.indexOf(s2) > -1

I tested both, each is about 35x faster than matches.

Andrey
  • 2,503
  • 3
  • 30
  • 39
0

You absolutely should be creating a single Matcher object in this situation, and using that single object in every loop iteration. You are currently creating a new matcher (and compiling a new Pattern) in each loop iteration.

At the top of your code, do this:

//"": Unused to-search string, so the matcher object can be reused
Matcher mtchr = Pattern.compile(".*" + s2 + ".*").matcher("");

Then in your loop, do this:

if(mtchr.reset(s).matches())  {
   ...

But I'll agree with @maaartinus here, and say that, given your requirements, you don't need regex at all, and can instead use indexOf(s), or even better, contains(s), as you don't seem to need the resulting index.

Regardless, this concept of reusing a matcher is invaluable.

aliteralmind
  • 19,847
  • 17
  • 77
  • 108