1

I'm looking for how to create a regular expression, which is 100% equivalent to the "contains" method in the String class. Basically, I have thousands of phrases that I'm searching for, and from what I understand it is much better for performance reasons to compile the regular expression once and use it multiple times, vs calling "mystring.contains(testString)" over and over again on different "mystring" values, with the same testString values.

Edit: to expand on my question... I will have many thousands of "testString" values, and I don't want to have to convert those to a format that the regular expression mechanism understands. I just want to be able to directly pass in a phrase that users enter, and see if it is found in whatever value "mystring" happens to contain. "testString" will not change it's value ever, but there will be thousands of them so that is why I was thinking of creating the matcher object and re-using it over and over etc. (Obviously my regexp skills are not up to snuff)

user85116
  • 4,422
  • 7
  • 35
  • 33
  • 1
    did you try looking at the implementation of String.contains in the JDK? – Woot4Moo Oct 21 '11 at 14:22
  • Unless you know what you are looking for before hand, it would not be possible to pre-compile a regular expression to match possible terms. Hence the flexible implementation of `contains()`. – Jason McCreary Oct 21 '11 at 14:23
  • possible duplicate of [Is the Contains Method in java.lang.String Case-sensitive?](http://stackoverflow.com/questions/86780/is-the-contains-method-in-java-lang-string-case-sensitive) – Woot4Moo Oct 21 '11 at 14:23
  • Because there is not an option to merge answers/questions I have voted to mark this as a dupe. If you look at the accepted answer in my dupe link this should be sufficient. – Woot4Moo Oct 21 '11 at 14:24

2 Answers2

2

You can use the LITERAL flag when compiling your pattern to tell the engine you're using a literal string, e.g.:

 Pattern p = Pattern.compile(yourString, Pattern.LITERAL);

But are you really sure that doing that and then reusing the result is faster than just String#contains? Enough to make the complexity worth it?

T.J. Crowder
  • 1,031,962
  • 187
  • 1,923
  • 1,875
1

Well you could use Pattern.quote to get a "piece of regular expression" for each input string. Do any of your terms contain line breaks? If so, that could at least make life slightly trickier, though far from impossible.

Anyway, you'd basically just join the quoted terms together as:

Pattern pattern = Pattern.compile("quoted1|quoted2|quoted3|...");

You might want to use Guava's Joiner to easily join the quoted strings together, although obviously it's not terribly hard to do manually.

However, I would try this and then test whether it's actually more efficient than just calling contains. Have you already got a benchmark which shows that contains is too slow?

Jon Skeet
  • 1,421,763
  • 867
  • 9,128
  • 9,194