0

I have an external program which generate some data I need. Usually, I redirect its output to a file, then read it from my Scala application, e.g.

app.exe > output.data

Now, I want to integrate the process, so I did

val stream = "app.exe" lineStream
stream foreach { line => doWork(_) }

Unfortunately, I got GC overhead exception after a while. This app.exe may generate very large output files, e.g. over 100MB. So I think during the streaming, Scala has been creating/destroying the line string instance thousands of times, and cause the overhead.

I know I can tune the JVM variables to increase the GC overhead throttling. But I am looking for a way that it doesn't need to create a lot of small line instances.

sjrd
  • 21,805
  • 2
  • 61
  • 91
David S.
  • 10,578
  • 12
  • 62
  • 104

1 Answers1

3

The problem is probably due to memoization, which is a side effect of foreach-ing over a stream this way. Effectively, you are rooting the whole file in memory.

See lots and lots of info on how to avoid this here: http://blog.dmitryleskov.com/programming/scala/stream-hygiene-i-avoiding-memory-leaks/

Specifically, you are violating rule #1. Try defining your stream as a def, not a val.

Chris Shain
  • 50,833
  • 6
  • 93
  • 125