2

I'm working on a reg expression match function. The problem is, this function will be called by the framework inside a nested loop. If temporary objects are created, GC will cause very big performance problem.

Is it possible to deal with regexp things without create temp objects (Pattern,Matcher)? Rewrite regexp classes is my last choice...

Michael Berry
  • 70,193
  • 21
  • 157
  • 216
user1192878
  • 704
  • 1
  • 10
  • 20
  • 1) GC should NOT cause a performance problem ever! 2) You cannot assume that just because you are not using temporary objects that GC will not be called during your loop. 3) If you want the object to persist then you should http://stackoverflow.com/questions/1329926/how-to-prevent-an-object-from-getting-garbage-collected – Ahmed Masud Feb 06 '12 at 17:50
  • @Ahmed-Masud : well technically this is not correct, since the main (full) GC cycle is dependant on the amount of memory allocated to the JVM, more memory means more time for the cycle to finish - which has some effects on the system – BigFatBaby Feb 06 '12 at 18:00
  • @BigFatBaby I agree that the amount and chunks of memory allocated will cause GC thread to work harder (or not). My point in 1) was that either the system is powerful enough for the app or it isn't, and that GC should be launch-time tuned. If iron is out of resources then any approach is fraught with peril anyhow. Whether one holds refs to all objects to avoid GC (mem hog); or write external methods (running locally) using JNI and completely bypass GC one is still limited by the limits of the system resources available. – Ahmed Masud Feb 06 '12 at 18:28
  • @Ahmed-Masud : rule of thumb is - if GC is causing you troubles, you are doing it wrong... i agree... however i am here to help a problem (asked specifically)... for common practices there are other sites on the Stack Exchange :) – BigFatBaby Feb 06 '12 at 22:37

6 Answers6

1

Your best bet is to deal with the issues as and when they arise - which they probably won't. Performance problems around GC'ing large numbers of small lived objects was a problem around a decade ago, but now it's incredibly good at it.

If you do need to optimise then this should be in the form of changing the GC options - the size of the young generation for instance, and not trying to optimise in code.

Michael Berry
  • 70,193
  • 21
  • 157
  • 216
  • It is a extreme case in which huge amount of data to be processed. Tons of small objects cause troubles. We tried GC options but couldn't solve them all. – user1192878 Feb 06 '12 at 23:10
1

Matcher objects are not threadsafe so you can't re-use them unless you call the reset() method (which in a single thread should work fine) - see Is Java Regex Thread Safe?

Community
  • 1
  • 1
Adam Rofer
  • 6,121
  • 1
  • 14
  • 11
1

To quote an old saying:

Make it work, make it right, make it fast. (in that order)

So before going down any heavy optomization steps, just write the initial straightforward appropriate code (which in this case would involve pre-compiling your patterns if you can). Run some tests and see if the performance is inadequate, and then optimize if the regex portion is a bottleneck.

If the object creation (and cleanup) is a serious bottleneck (as compared to the actual regex parsing itself), then you may need to implement your own solution that uses an object pool (so objects are not created, just reset and reused from the pool). I doubt that this will result in any serious performance gains though, so you should benchmark first just to see how much gain is even possible (if you improve object creation / cleanup performance by 50%, would it be worth it?).

Trevor Freeman
  • 7,112
  • 2
  • 21
  • 40
  • Thanks. My colleague found gc related issue in their previous work. So members in the team are required to avoid them. But you are right. I will make it work first and then test. – user1192878 Feb 06 '12 at 23:06
0

This sounds like premature optimization.

Write the most straightforward code you can, then profile it in a realistic setting, and see whether there are any problems with performance or memory allocation patterns. If there are, address the specific issues you've uncovered.

Modern JVMs are incredibly good at garbage collecting short-lived objects.

NPE
  • 486,780
  • 108
  • 951
  • 1,012
0

You can precompile your regexes which makes sense if you reuse the same regex multiple times.

Instead of

boolean foundMatch = subjectString.matches("a.*b");

(where a temporary compiled Pattern will be created anyway), you can use

Pattern regex = Pattern.compile("a.*b");
// loop here
// do something...
    Matcher regexMatcher = regex.matcher(subjectString);
    boolean foundMatch = regexMatcher.matches()
// loop end

Hard to say if there will be any relevant performance benefit, though.

Tim Pietzcker
  • 328,213
  • 58
  • 503
  • 561
0

the way i see it you have 2 viable options:

  1. Write your own logic for matching regex by looking at the source code: Pattern and Matcher.
  2. explicitly initiate a collection of those objects when you are done with them by running their corresponding finalize() function instead of waiting for the GC to run it.

Pros and Cons

  1. A lot of work, needs to be tested and maintained in the future, however you get full control what you are trying to do.

  2. It's not recommended to interfere in the workings of the GC, clean solution and simple solution

BigFatBaby
  • 1,525
  • 9
  • 19