5

I notice that the following code using multiple threads and keep all CPU cores busy about 100% while it is reading the file.

scala.io.Source.fromFile("huge_file.txt").toList

and I assume the following is the same

scala.io.Source.fromFile("huge_file.txt").foreach

I interrupt this code as a unit test under Eclipse debugger on my dev machine (OS X 10.9.2) and showing these threads: main, ReaderThread, 3 Daemon System Thread. htop shows all threads are busy if I run this in a scala console in a 24-cores server machine (ubuntu 12).

Questions:

  1. How do I limit this code on using N number of threads?
  2. For the sake of understanding the system performance, can you explain to me what, why and how this is done in io.Source? Reading the source doesn't helping.
  3. I assume each line is read in sequence; however, since it is using multiple threads, so is the foreach run in multiple threads? My debugger seems to tell me that the code still run in the main thread.

Any insight would be appreciated.

yuushi
  • 123
  • 6
  • Are you sure you're not seeing garbage collector activity on all threads? – Rex Kerr May 13 '14 at 20:44
  • I don't think so as all 24 cores are close to 100% on making a list. Temporary object cleaning shouldn't create such heavy load I believe. – yuushi May 13 '14 at 20:51
  • 3
    Maybe you should make sure with `-XX:+UseSerialGC`? – Rex Kerr May 13 '14 at 21:01
  • When you call `toList`, you are forcing this really big file into memory which is never a good idea and probably the cause of all the thrashing you are seeing on your computer. I can't see any realistic use case where reading a really big file entirely into memory is a good idea. That's why you start with an `Iterator` and have to make an explicit call to force it into a List – cmbaxter May 13 '14 at 21:38
  • Not quite a dupe, but people ask this often http://stackoverflow.com/q/23007646/1296806 – som-snytt May 13 '14 at 21:59
  • This is the code I excerpt from my main code using iterator to do some sequential computation. However this line of code makes its multi-threaded. In the search of how Scala parallelize my code, I found this code doing this. – yuushi May 13 '14 at 22:00
  • Thank to Rex Kerr, After I use -XX:+UseSerialGC option, it uses only 2 threads i.e. scala.Source.io.fromFile() should have generated a lot of temporary objects to keep GC very busy and use all threads to clean them up. This is a surprise. – yuushi May 14 '14 at 20:39
  • It also depends on your memory settings, etc. Are you growing heap too? You could answer your own question and add some details like that. – som-snytt May 15 '14 at 12:31

1 Answers1

0

As suggested, I put my findings here.

I use the following to test my dummy code with and without -J-XX:+UseSerialGC option

$ scala -J-XX:+UseSerialGC
scala> var c = 0
scala> scala.io.Source.fromFile("huge_file.txt").foreach(e => c += e)

Before I use the option, all 24 cores in my server machine are busy during the file read. After the option, only two threads are busy.

enter image description here

Here is the memory profile I captured on my dev machine, not server. I first perform the GC to get the baseline, then I run the above code several times. The Eden Space got clean up periodically. The memory swing is about 20M, while the smaller file I read is about 200M i.e. io.Source creates 10% of temporary objects per each run.

enter image description here

This characteristics will create trouble in a shared system. This will also limit us to handle multiple big files all at once. This stresses memory, i/o and CPU usage in a way that I can't run my code with other production jobs, but run it separately to avoid having this system impact.

If you know a better way or suggestion to handle this situation in a real shared production environment, please let me know.

yuushi
  • 123
  • 6