6

I have a query that returns me around 6 million rows, which is too big to process all at once in memory.

Each query is returning a Tuple3[String, Int, java.sql.Timestamp]. I know the string is never more than about 20 characters, UTF8.

How can I work out the max size of one of these tuples, and more generally, how can I approximate the size of a scala data-structure like this?

I've got 6Gb on the machine I'm using. However, the data is being read from the database using scala-query into scala's Lists.

Squidly
  • 2,707
  • 19
  • 43

2 Answers2

6

Scala objects follow approximately the same rules as Java objects, so any information on those is accurate. Here is one source, which seems at least mostly right for 32 bit JVMs. (64 bit JVMs use 8 bytes per pointer, which generally works out to 4 bytes extra overhead plus 4 bytes per pointer--but there may be less if the JVM is using compressed pointers, which it does by default now, I think.)

I'll assume a 64 bit machine without compressed pointers (worst case); then a Tuple3 has two pointers (16 bytes) plus an Int (4 bytes) plus object overhead (~12 bytes) rounded to the nearest 8, or 32 bytes, plus an extra object (8 bytes) as a stub for the non-specialized version of Int. (Sadly, if you use primitives in tuples they take even more space than when you use wrapped versions.). String is 32 bytes, IIRC, plus the array for the data which is 16 plus 2 per character. java.sql.Timestamp needs to store a couple of Longs (I think it is), so that's 32 bytes. All told, it's on the order of 120 bytes plus two per character, which at ~20 characters is ~160 bytes.

Alternatively, see this answer for a way to measure the size of your objects directly. When I measure it this way, I get 160 bytes (and my estimate above has been corrected using this data so it matches; I had several small errors before).

Community
  • 1
  • 1
Rex Kerr
  • 166,841
  • 26
  • 322
  • 407
  • Good point, I forgot about the extra overhead in the String plus object overhead. Still, it's not very much data. – oxbow_lakes Jun 26 '12 at 14:28
  • Why 24 plus 2 per character on the String array? IIRC, an Array is 8 bytes vs 4 bytes for a non-array, plus the elements. – Daniel C. Sobral Jun 26 '12 at 16:28
  • @DanielC.Sobral - There's object overhead plus length, which is 16 bytes on a 64 bit machine, so I was off by a bit. – Rex Kerr Jun 26 '12 at 16:50
2

How much memory have you got at your disposal? 6 million instances of a triple is really not very much!

Each reference has an overhead which is either 4 or 8 bytes, dependent on whether you are running 32- or 64-bit (without compressed "oops", although this is the default in JDK7 for heaps under 32Gb).

So your triple has 3 references (there may be extra ones due to specialisation - so you might get 4 refs), your Timestamp is a wrapper (reference) around a long (8 bytes). Your Int will be specialized (i.e. an underlying int), so this makes another 4 bytes. The String is 20 x 2 bytes. So you basically have a worst case of well under 100 bytes per row; so 10 rows per kb, 10,000 rows per Mb. So you can comfortably process your 6 million rows in under 1 Gb of heap.

Frankly, I think I've made a mistake here because we process daily several million rows of about twenty fields (including decimals, Strings etc) comfortably in this space.

oxbow_lakes
  • 133,303
  • 56
  • 317
  • 449