2

In my project, we have a requirement to read a very large file, where each line has identifiers separated by a special character ( "|"). Unfortunately I can't use parallelism, since it is necessary to make a validation between the last character of a line with the first of the next line, to decide whether it will or not be extracted. Anyway, the requirement is very simple: break the line into tokens, analyze them and store only some of them in memory. The code is very simple, something like below:

final LineIterator iterator = FileUtils.lineIterator(file)
while(iterator.hasNext()){
   final String[] tokens = iterator.nextLine().split("\\|");
   //process
}

But this little piece of code is very, very inefficient. The method split() generates too many temporary objects that are not been collected (as best explained here: http://chrononsystems.com/blog/hidden-evils-of-javas-stringsplit-and-stringr.

For comparison purposes: a 5mb file was using around 35 mb memory at the end of file process.

I tested some alternatives like:

But none of them appears to be efficient enough. Using JProfiler, I could see that the amount of memory used by temporary objects is too high (35 mb used, but only 15 mb is actually been used by valid objects).

Then I decide make a simple test: after 50,000 lines read, explicit call to System.gc(). And then, at the end of process, the memory usage has decreased from 35 mb to 16mb. I tested many, many times, and always got the same result.

I know invoke that invoke of System.gc () is a bad practice (as indicated in Why is it bad practice to call System.gc()?). But is there is any other alternative in a cenario where the split() method could be invoked millions of times?

[UPDATE] I use a 5 mb file only for test purpose, but the system should process much larger files (500Mb ~ 1Gb)

Community
  • 1
  • 1
DanielSP
  • 349
  • 3
  • 13
  • 2
    *"The method split() generates too many temporary objects that are not been collected (as best explained here: http://chrononsystems.com/blog/hidden-evils-of-javas-stringsplit-and-stringr ."* Too bad it doesn't explain what you claim. It is also unclear why you want to split your String, instead of parsing it. – Tom May 03 '16 at 14:43
  • What is the criterion on which you accept or reject the elements of `tokens`? – Andy Turner May 03 '16 at 14:45
  • 3
    The obvious other solution is to not split the string, but scan/parse/process the string in-situ. – Mark Rotteveel May 03 '16 at 14:50
  • Even if 35mb are used, does it really matter? If your JVM doesn't have that much memory it will try to collect in between anyways, if it has then why bother? In the end it will collect eventually. – Thomas May 03 '16 at 14:52
  • 1
    The difference between 35 MB and 16 MB is worth about 10 cents. How much is your time worth trying to save 10 cents of memory? on minimum wage that is about 1 minute. In general, don't call System.gc(), let the JVM does it when it needs to. – Peter Lawrey May 03 '16 at 14:56
  • **@Tom** and **@Andy**: I use it because the system should extract from the line an array of tokens to check the total number found and the position of each one in the array. For example, if I found a token with '0' at position 5 in the String[], so the value '1' should be at positon 23. In this case, it indicates the this line contains a sensitive data to be extracted. If the String[] has less than 20 tokens, then I should check other positions. – DanielSP May 03 '16 at 15:20
  • **@PeterLawrey** and **@Thomas**, my System should process very large files (around 500Mb and Gb). What worries me is the fact that the amount of temp objects generated could cause a OutOfMemory during runtime – DanielSP May 03 '16 at 15:26
  • have you tried compiled pattern + [splitAsStream](https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html#splitAsStream-java.lang.CharSequence-). And if you want to run this on gigabytes of data, then you should probably test with a realistic test data set. – the8472 May 03 '16 at 15:36
  • **@the8472**: I tested with a huge file, but it takes a lot of time and result in "GC overhead limit exceeded" – DanielSP May 03 '16 at 17:24
  • Would something like https://docs.oracle.com/javase/7/docs/api/java/io/RandomAccessFile.html suit your needs? It indeed reads through a file character by character, but at least it doesn't load everything in at once. So processing is longer, but at least it won't be memory intensive. – Bartvbl May 04 '16 at 09:16

2 Answers2

1

The first and most important thing to say here is, don't worry about it. The JVM is consuming 35MB of RAM because it's configuration says that's a low enough amount. When its highly efficient GC algorithm decides it's time, it will sweep all those objects away, no problem.

If you really want to, you can invoke Java with memory management options (e.g. java -Xmxn=...) -- I suggest it's not worth doing unless you're running on very limited hardware.

However, if you really want to avoid allocating an array of String each time you process a line, there are many ways to do so.

One way is to use a StringTokenizer:

    StringTokenizer st = new StringTokenizer(line,"|");

    while (st.hasMoreElements()) {
        process(st.nextElement());
    }

You could also avoid consuming a line at a time. Get your file as a stream, use a StreamTokenizer, and consume one token at a time in this way.

Read the API docs for Scanner, BufferedInputStream, Reader -- there are lots of choices in this area, because you're doing something fundamental.

However, none of these will cause Java to GC sooner or more aggressively. If the JRE doesn't consider itself short of memory, it won't collect any garbage.

Try writing something like this:

public static void main(String[] args) {
    Random r = new Random();
    Integer x;
    while(true) {
        x = Integer.valueof(r.nextInt());
    }
}

Run it and watch your JVM's heap size as it runs (put a sleep in if the usage shoots up too quickly to see). Each time around the loop, Java creates what you call a 'temporary object' of type Integer. All of these stay in the heap until the GC decides it needs to clear them away. You'll see that it won't do this until it reaches a certain level. But when it reaches that level, it will do a good job of ensuring that its limits are never exceeded.

slim
  • 40,215
  • 13
  • 94
  • 127
1

You should adjust your way of analyzing situations. While the article about the regex compilation under the hood is correct in general, it doesn’t apply here. When you look at the source code of String.split(String), you’ll see that it just delegates to String.split(String,int) which has a special code path for patterns consisting of just one literal character, including escaped ones like your \|.

The only temporary object created within that code path is an ArrayList. The regex package is not involved at all; this fact might help you understanding why precompiling a regex pattern did not improve the performance here.

When you use a Profiler to come to the conclusion that there are too many objects, you should use it also to find out what kinds of objects there are and where they originate, instead of doing wild guessing.

But it’s not clear, why you complain at all. You can configure the JVM to use a certain maximum memory. As long as that maximum has not been reached, the JVM just does what you told it, using that memory rather than wasting CPU cycles just to not using the available memory. Where’s the sense in not using the available memory?

Holger
  • 285,553
  • 42
  • 434
  • 765
  • Thanks for your answear **Holger**, but in fact, when I check the list of objects available in the Profiler, there is a lot of Object[] growing exactly after each split() call. I try to test with small files because it's fast (I can run 20 tests and get the average time and memory usage). Another thing that is interesting: in this case, I know that 32mb should be enough to process the file, but when I run the test with -Xms16m -Xmx32m, it results in "GC overhead limit exceeded" – DanielSP May 03 '16 at 17:29
  • `ArrayList` encapsulates an `Object[]` instance, that’s not surprising. Getting the “GC overhead limit exceeded” error just proves that trying to limit the memory unnecessarily will degrade performance, as that error says exactly, that too much time has been spent in garbage collection, see http://stackoverflow.com/q/1393486/2711488 – Holger May 03 '16 at 17:48