1

I am doing an analysis on a rather large scale (1000's of projects) for which I am extracting test framework usage from source code (e.g. detecting assertEquals to measure assert density). For this, I do not want to take into account any statements that have been commented out. In order to do this, I have the following method:

public static CharSequence replaceAllRegexInFile(CharSequence input, String regex) {
    if (regex == null || input == null) {
        return input;
    }
    Pattern pattern = Pattern.compile(regex);
    return pattern.matcher(input).replaceAll("");
}

I am running this method with the following regex to replace Java comments :

(\/\*([\S\s]+?)\*\/|(?s)/\*.*?\*/)". 

I am well aware that replaceAll is allocating a lot of intermediate results while aggregating and returning the final result. Surely, I could resort to using replace, but this will not allow me to use a regex for replacing the comments.

I get why the heapspace error is thrown, especially since I am streaming all files and all projects concurrently over my entire machine. Surely this is using a lot of resources, but I am unable to find an alternative solution for my problem since the regex replacement is definitelly a requirement.

Any suggestions would be greatly appreciated.

You can find the stacktrace below:

Exception in thread "main" java.lang.OutOfMemoryError
  at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
  at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
  at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
  at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
  at java.util.concurrent.ForkJoinTask.getThrowableException(ForkJoinTask.java:598)
  at java.util.concurrent.ForkJoinTask.reportException(ForkJoinTask.java:677)
  at java.util.concurrent.ForkJoinTask.invoke(ForkJoinTask.java:735)
  at java.util.stream.ReduceOps$ReduceOp.evaluateParallel(ReduceOps.java:714)
  at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:233)
  at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499)
  at AnalysisRunner.startAnalysis(AnalysisRunner.java:33)
  at AnalysisRunner.main(AnalysisRunner.java:26) 
Caused by: java.lang.OutOfMemoryError: Java heap space
  at java.util.Arrays.copyOf(Arrays.java:3332)
  at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:124)
  at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:541)
  at java.lang.StringBuffer.append(StringBuffer.java:350)
  at java.util.regex.Matcher.appendReplacement(Matcher.java:888)
  at java.util.regex.Matcher.replaceAll(Matcher.java:955)
  at Business.RegexService.replaceAllRegexInFile(RegexService.java:64)
  at Business.FrameWorkDetectionService.extractAllResultsForFile(FrameWorkDetectionService.java:58)
  at Business.FrameWorkDetectionService.lambda$extractFrameworkDependencies$0(FrameWorkDetectionService.java:39)
  at Business.FrameWorkDetectionService$$Lambda$19/1175339539.apply(Unknown Source)
  at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
  at java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1374)
  at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481)
  at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471)
  at java.util.stream.ReduceOps$ReduceTask.doLeaf(ReduceOps.java:747)
  at java.util.stream.ReduceOps$ReduceTask.doLeaf(ReduceOps.java:721)
  at java.util.stream.AbstractTask.compute(AbstractTask.java:316)
  at java.util.concurrent.CountedCompleter.exec(CountedCompleter.java:731)
  at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289)
  at java.util.concurrent.ForkJoinPool.helpComplete(ForkJoinPool.java:1870)
  at java.util.concurrent.ForkJoinPool.awaitJoin(ForkJoinPool.java:2045)
  at java.util.concurrent.ForkJoinTask.doInvoke(ForkJoinTask.java:404)
  at java.util.concurrent.ForkJoinTask.invoke(ForkJoinTask.java:734)
  at java.util.stream.ReduceOps$ReduceOp.evaluateParallel(ReduceOps.java:714)
  at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:233)
  at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499)
  at Business.FrameWorkDetectionService.extractFrameworkDependencies(FrameWorkDetectionService.java:39)
  at Business.FrameWorkDetectionService.detectFrameworks(FrameWorkDetectionService.java:26)
  at Business.FrameworkService.projectResults(FrameworkService.java:59)
  at AnalysisRunner$$Lambda$13/1712669532.apply(Unknown Source)
  at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
  at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175)

Is there an alternative solution that will not allocate this much heap space that will still allow me to replace all comments in a lot of files concurrently?

Any help is greatly appreciated!

trincot
  • 317,000
  • 35
  • 244
  • 286
user1390504
  • 183
  • 4
  • 15
  • 4
    A usual issue in Java when you use inefficient regex patterns with alternations. Use the regex from [Regex to match a C-style multiline comment](http://stackoverflow.com/a/36328890/3832970). BTW, there is no point to duplicate patterns inside one pattern: `/\*([\S\s]+?)\*/` matches the same text as `(?s)/\*.*?\*/` – Wiktor Stribiżew Jan 12 '17 at 10:53

2 Answers2

1

Probably Java does not allocate enough memory for your app. You can try to increase initial and maximum memory allocated by using -Xmx and -Xms flags, for example:

java -Xmx2048m -Xms512m yourApp

Adjust these parameters so application does not crush.

You can see all possible parameters by running java -X

If changing allocated memory does not help, try creating a heap dump using jmap -heap:format=b <process-id> while your application is running. Then open it in some kind of memory analyzer (for example http://www.eclipse.org/mat/). Maybe there are some memory leaks in other parts of the code. This will detect them.

Janothan
  • 446
  • 4
  • 16
1

I think this is more like a big comment than a answer, but posting as answer since it's richer for formatting.

Your regex has not a good performance which might be causing such a big memory error. For instance, this is the diagram your regex has:

enter image description here

What I understand from this is that you just want to get rid of block comments. So, there are different problems in your regex, the most important is that you have different patterns to do exactly the same, therefore you should just use one of them, by doing that you can get rid of capturing groups and alternations and just use of them like:

\/\*[\S\s]+?\*\/".   <--- I removed the capturing group to make it more efficient, since you didn't need it
or 
(?s)/\*.*?\*/".

enter image description here

As you can see, the regex pattern is much more efficient, it doesn't have 2 patterns, nor 2 capturing groups nor the alternation that are very expensive.

Anyway, if you don't need java, then I think there are much better tools to perform these replaces like sed with -i flag (replace in place)

However, if you still want to use your regex, then you can improve by removing the un-needed capturing group and transforming the capturing one into non-capturing group like this:

(?:\/\*[\S\s]+?\*\/|(?s)/\*.*?\*/)". 
Federico Piazza
  • 30,085
  • 15
  • 87
  • 123