2

I have the code below that first read a file and then put these information in a HashMap(indexCategoryVectors). The HashMap contains a String (key) and a Long (value). The code uses the Long value to access a specific position of another file with RandomAccessFile.

By the information read in this last file and some manipulations the code write new information in another file (filename4). The only variable that accumulates information is the buffer (var buffer = new ArrayBuffer[Map[Int, Double]]()) but after each interaction the buffer is cleaned (buffer.clear).

The foreach command should run more than 4 million times, and what I'm realizing there is an accumulation in memory. I tested the code with a million times interaction and the code used more than 32GB of memory. I don't know the reason for that, maybe it's about Garbage Collection or anything else in JVM. Does anybody knows what can I do to prevent this memory leak?

def main(args: Array[String]): Unit = {
  val indexCategoryVectors = getIndexCategoryVectors("filename1")
  val uriCategories = getMappingURICategories("filename2")

  val raf = new RandomAccessFile("filename3", "r")
  var buffer = new ArrayBuffer[Map[Int, Double]]()
  // Through each hashmap key.
  uriCategories.foreach(uri => {
    var emptyInterpretation = true        
    uri._2.foreach(categoria => {
        val position = indexCategoryVectors.get(categoria)
        // go to position
        raf.seek(position.get)
        var vectorSpace = parserVector(raf.readLine)
        buffer += vectorSpace
        //write the information of buffer in file
        writeInformation("filename4")
        buffer.clear
      }
    })
  })
  println("Success!")
}
Marcelo Machado
  • 1,179
  • 2
  • 13
  • 33
  • How much memory is the JVM being allowed and how are you determining how much it is using? – ThisIsNoZaku Feb 21 '17 at 04:12
  • @ThisIsNoZaku I'm allocating 32gb with -Xmx32g in java_opts. To know how much it's using, I use the command free in linux terminal since "the only thing running" is that code. – Marcelo Machado Feb 21 '17 at 04:44
  • 5
    If you're not getting an `OutOfMemoryError` you likely don't have a leak, the JVM is simply using everything it's given. Try a JVM memory profiler if you want to see what's going on in there, the Oracle JDK has VisualVM packaged with it. – ThisIsNoZaku Feb 21 '17 at 05:01
  • 2
    If you are giving the JVM 32gb and you are seeing it using 32gb, that is ok. As @ThisIsNoZaku told you, try to use VisualVM to understand how your code is using the allocated memory. Also, see [this discussion](http://stackoverflow.com/q/21272877/4600). Anyway, your code also has some possible leaks: if an exception occurs at the `while` loop, per instance, your code won't close the input streams. See [this discussion](http://stackoverflow.com/q/2207425/4600) about resource management in Scala. – marcospereira Feb 21 '17 at 05:14
  • Maybe you didn't intend to allocate `buffer` each time. `buffer.clear` suggests you wanted to reuse it. That clears references to array elements, but keeps the underlying array. – som-snytt Feb 21 '17 at 05:42
  • @ThisIsNoZaku I didn't have OutOfMemoryError because I stopped the program, it was using swap memory. As I said, it used more than 32gb. I will try to use the VisualVM. – Marcelo Machado Feb 21 '17 at 06:32
  • @marcospereira I read the first discussion and understood better how the heap memory works. I made a mistake trying to allocate 32GB for memory heap because my total memory is exactly of 32gb. But my process is using more than that, it's using swap memory, then I think the problem is bigger than that. I will try to use VisualVM, but I can't imaging why this code use too much memory. – Marcelo Machado Feb 21 '17 at 06:43
  • @som-snytt Yes, I already change that, but the problem persists – Marcelo Machado Feb 21 '17 at 06:46
  • 2
    This sounds unusual: if it is using more memory than what was given to the JVM (using `-Xmx`), then you should be getting an `OutOfMemoryError`. Of course, if your system does not have 32gb available and you try to allocate this amount of memory to the JVM, the OS will do swapping, but this has nothing to do with your code. Anyway, I think that what you want is a stream processing framework, like [Akka Streams](http://doc.akka.io/docs/akka/2.4.17/scala/stream/stream-introduction.html#motivation), since you will process a large amount of data and don't need it all to be in memory. – marcospereira Feb 21 '17 at 06:49
  • 1
    Take a look [here](http://doc.akka.io/docs/akka/2.4.17/scala/stream/stream-io.html#Streaming_File_IO) and [here](http://engineering.intenthq.com/2015/06/wikidata-akka-streams/). – marcospereira Feb 21 '17 at 06:54
  • @marcospereira Yes, perhaps it's swapping because I put 32GB in memory heap I will change it and see what happens. Is very nice to know about Akka Streams, this seems to be exactly what I need. I am using Wikidata too. I will try to use that to solve my problems. – Marcelo Machado Feb 21 '17 at 07:08
  • I think the question is good and interesting, but perhaps the code should be shortened a bit - can you provide just a code demonstrating the idea, instead of the real production one? – Suma Feb 21 '17 at 07:39
  • @Suma done, thank you! – Marcelo Machado Feb 21 '17 at 08:10

0 Answers0