2

i was testing ruby, python and scala to see which one has better support for huge arrays.

Ruby (each) -> 84 MB, 2s # a = [];(0...1e7).each{|i| a[i]=i}
Ruby (expand) -> 254 MB, 60s #a = [*(0...1e7)]
Python (numpy) -> 95 MB, < 1s # a = np.arange(0, 1e7)
Python (range) -> 391 MB, < 1s # a = list(range(0, int(1e7)))
Scala (range) -> Memory Limit Exceeded :-) # var a = Range(0, 1e7.toInt).map(_.toInt)

as you can see, memory limit exceeds when using scala, what's the problem?!!!

--UPDATE

ok i should've used

var a = Array.range(0, 1e7.toInt)

but, what if i want to map the array to strings?

var b = a.map(_.toString())

Failure here...

aliep
  • 1,702
  • 2
  • 21
  • 33
  • I tried your Scala line and it worked fine, and took less than 1 second to run (Scala 2.10.1, 64-bit Java 7 JVM). – Jesper Apr 25 '13 at 08:54
  • @Jesper what is your memory settings (e.g. `-Xmx`)? – om-nom-nom Apr 25 '13 at 08:55
  • @om-nom-nom Just using default settings. Note that the 64-bit JVM by default takes 1/4 of available system RAM as the max. heap size (on Windows, at least). Just checked with VisualVM, the increase in memory usage seems to be about 200 MB. – Jesper Apr 25 '13 at 08:57
  • yeah the code is correct, and it worked well for the first time, but since when i closed scala and reopened it, i failed to rerun the code – aliep Apr 25 '13 at 08:58
  • @Jesper I'm asking because I've default settings too (and 8G of RAM, on 64bit OSX), but my range is failing just as op's. – om-nom-nom Apr 25 '13 at 08:58
  • this one, "var a = Array.range(0, 1e7.toInt)" works just fine, thanks** – aliep Apr 25 '13 at 09:00
  • the problem is with map var a = Array.range(0, 1e7.toInt) a.map(_.toString()) //FAIL – aliep Apr 25 '13 at 09:03
  • That's not a direct answer on your question, but on the last NEScala there was a great talk on such things given by Avi Bryant http://www.youtube.com/watch?v=yzitqjUI6ok. If you really need work with big data in you project try Algebird from Twitter – 4lex1v Apr 25 '13 at 09:11

3 Answers3

4

Scala's Range isn't identical to array in a pure sense (inside it doing some additional work besides just keeping N items, so it is less memory efficient and eats memory faster).

Array.range(0, 1e7.toInt) will be closer to python's np.arange and will work fine.

om-nom-nom
  • 62,329
  • 13
  • 183
  • 228
  • 2
    `Range` isn't an array _at all_ and it doesn't keep N items. It just stores the endpoints, and if you map out of it it'll expand into a collection (which happens to be a `Vector`). – Rex Kerr Apr 25 '13 at 11:05
3

You want ten million strings of typical length 7 (and you have the ten million integers also), and on the JVM strings are two bytes per character (plus overhead), so you are certain to need at least 180 MB to hold this. Let's find out how much worse it really is:

$ scala -J-Xmx896M -e 'println(Array.range(0,1e7.toInt).map(_.toString).apply(9999999))'
9999999

$ scala -J-Xmx832M -e 'println(Array.range(0,1e7.toInt).map(_.toString).apply(9999999))'
java.lang.OutOfMemoryError: Java heap space

So there you go; about 4.5x worse than you'd predict. The JVM isn't very good at this sort of thing: each string is actually two objects (the string and the character array), and an object has about 12 bytes of overhead plus contents, and there's rounding up to nice boundaries. And the JVM itself needs some space to run (and for all the classes it loads especially with Scala including the interpreter). So this is "reasonable".

But you're not testing support for huge arrays here; you're testing how memory-efficient Java's String is, and it's not very.

Rex Kerr
  • 166,841
  • 26
  • 322
  • 407
  • 1
    To further illustrate @Rex's point, translating the numbers into a concatenated string (via StringBuilder) requires less than half the memory as an array of Strings: `scala -J-Xmx390M -e 'Array.range(0, 1e7.toInt).foldLeft(new StringBuilder) { _ append _.toString }'` Foregoing the array drops us further: `scala -J-Xmx330M -e '(0 to 1e7.toInt).foldLeft(new StringBuilder) { _ append _.toString }'`. With the now deprecated '-XX:+UseCompressedStrings', the snippet could be made to run with 130M. See [here](http://stackoverflow.com/q/8833385/217959) for why it was deprecated. – nadavwr Apr 25 '13 at 22:15
1

Your Scala code:

var a = Range(0, 1e7.toInt).map(_.toInt)

will throw a "java.lang.OutOfMemoryError" exception if the environment is out of memory -- this works as it should.

The problem you may be having is that your Scala code doesn't create a huge array, it creates a huge IndexedSeq. In Scala an IndexedSeq is like a B-tree. Change your Scala code to the following if you want to compare arrays and you might get a better result.

val a = (0 to 1e7.toInt).toArray[Int]
Keith Pinson
  • 1,725
  • 1
  • 14
  • 17