In my project, we have a requirement to read a very large file, where each line has identifiers separated by a special character ( "|"). Unfortunately I can't use parallelism, since it is necessary to make a validation between the last character of a line with the first of the next line, to decide whether it will or not be extracted. Anyway, the requirement is very simple: break the line into tokens, analyze them and store only some of them in memory. The code is very simple, something like below:
final LineIterator iterator = FileUtils.lineIterator(file)
while(iterator.hasNext()){
final String[] tokens = iterator.nextLine().split("\\|");
//process
}
But this little piece of code is very, very inefficient. The method split() generates too many temporary objects that are not been collected (as best explained here: http://chrononsystems.com/blog/hidden-evils-of-javas-stringsplit-and-stringr.
For comparison purposes: a 5mb file was using around 35 mb memory at the end of file process.
I tested some alternatives like:
- Using a pre compiled pattern (Performance of StringTokenizer class vs. split method in Java)
- Use Guava's Splitter (Java split String performances)
- Optimize String storage (http://java-performance.info/string-packing-converting-characters-to-bytes/)
- Use of optimized collections (http://blog.takipi.com/5-coding-hacks-to-reduce-gc-overhead)
But none of them appears to be efficient enough. Using JProfiler, I could see that the amount of memory used by temporary objects is too high (35 mb used, but only 15 mb is actually been used by valid objects).
Then I decide make a simple test: after 50,000 lines read, explicit call to System.gc(). And then, at the end of process, the memory usage has decreased from 35 mb to 16mb. I tested many, many times, and always got the same result.
I know invoke that invoke of System.gc () is a bad practice (as indicated in Why is it bad practice to call System.gc()?). But is there is any other alternative in a cenario where the split() method could be invoked millions of times?
[UPDATE] I use a 5 mb file only for test purpose, but the system should process much larger files (500Mb ~ 1Gb)