2

If I understood correctly, Java creates some overhead per class. If I want to create typical data structures such as linked lists, trees, tries, etc. the individual (list) items will be classes and therefore create a significant overhead as opposed to similar data structures in C. This becomes especially difficult for very large data sets. Is there some better way to implement those kinds of data structures in Java such that I wont have the overhead associated with storing classes in memory?

Here the memory consumption of Java objects is described. If I have millions of objects, the overhead by using objects might become too expensive. So I was wondering if there are better ways to approach such a situation.

  • In java colletions, items are objects of same type – Jay Smith May 23 '17 at 05:28
  • 1
    java has a very well developed group of classes and interfaces for that... see java.util.collection, as you are aware, nothing will happen on a low level like C, since the virtual machine is in the middle ALWAYS – ΦXocę 웃 Пepeúpa ツ May 23 '17 at 05:28
  • 1
    define *significant overhead* – Scary Wombat May 23 '17 at 05:30
  • Are you asking if there is a collection in Java that is implemented in C/C++/other-low-level-lang by standard (all VM implementation have it)? Or if there is a specific implementation you can use? – Liran Funaro May 23 '17 at 05:33
  • Define _very large data sets_. – Kevin Anderson May 23 '17 at 06:06
  • Considering the size, I'm thinking about 10s of GBs (of raw data) and data structures that are not practical for database storage. – Sebastian Klenk May 23 '17 at 07:03
  • 2
    *"the individual (list) items will be classes and therefore create a significant overhead as opposed to similar data structures in C"* ... the class loading and object instantiating mechanisms in Java have been continuously developed for decades at this point and you can count on them to be fast and efficient. These facilities and the garbage collector will outperform most self-rolled object pools, for example (except for very expensive objects like database connections). You choose Java precisely because you want the benefits of object abstraction, otherwise you develop at the level of C. – scottb May 23 '17 at 07:23
  • I don't think you want tens of GB all in memory at once. Or if you do you have quite specialist needs that you'll need to ask about specifically. There probably is a database that suits your need (not all databases are relational) and if not, again, your need is very specialist and you need to write the database you need! – slim May 23 '17 at 10:31
  • On reflection, I have voted to close as "too broad". There are too many potential answers. I suggest editing or asking again, including code for a data structure you consider to be wasteful of space, showing what overheads you consider unacceptable, and asking for a better way. – slim May 23 '17 at 10:45
  • PS the "right way" to approach it would be, as with performance optimisation, to write it the "dumb" way, measure, then target the measured problem areas. No point optimising seldom-used routines for speed. No point optimising seldom-used data structures for size. – slim May 23 '17 at 10:48

4 Answers4

1

You can implement these collections over chunks of bytes (obtained as new byte[...] or ByteBuffer.allocate[Direct](...) or unsafe.allocateMemory(...)). You can then manage this memory manually: pack/unpack your objects to and from byte chunks along with additional data (like indices of left and right values for a binary tree, index of next for a linked list etc.) This way you will not have to spend memory on object headers, extra references, alignment (although you might decide you do need to introduce your own alignment); can have your objects offheap; can have them mapped to a filesystem for persistence etc. However, it's not simple and incurs subtleties (e.g. you might start depending on malloc implementation and lose JVM heap optimizations; lose memory model guarantees; your objects might be split between cache lines; you will lose benefits of GC compaction etc.). I'm not saying any of these is a show-stopper, just that it's not all roses and you should understand what exactly you are gaining. If you have millions of objects, well, it's likely that the overhead is in 100s of megabytes. Make sure it's worth it to try to save them (compared to how much necessary data takes + compared to how large your heap is).

starikoff
  • 1,601
  • 19
  • 23
0

You always can use the c++ native code inside Java (JNI) to increase the performance and the level of control (I don't think you really need this and I'm not sure that you can surpass standard java code).

user3811082
  • 218
  • 1
  • 7
0

A quick Google search on "c++ library jni" turned up this article, entitled Wrapping a C++ library with JNI – introduction that might prove interesting. I didn't read it, so I make no recommendation or guarantee as to the contents.

Kevin Anderson
  • 4,568
  • 3
  • 13
  • 21
0

If you have datasets where java's object size overhead is a practical issue, I'd suggest consider using a database. You can start with an in-memory, embedded database, like sqlite, h2, or redis.

As your data gets larger, you'll need more complex management. Updating cross references, indexes, and the like manually to ensure that your data can be efficiently queried is a huge effort that a database can help with.

Using a proper database also allows you to grow the data size further when your data starts reaching hundreds of gigabytes level where it no longer fits on memory, and when you have to transition to actually start using the disk, or even when you reach multi terabytes level when you have to use multiple machine(s) to hold the data, without major rewrites.

A proper database can grow together with your application, a bunch of objects in memory can't.

Lie Ryan
  • 62,238
  • 13
  • 100
  • 144
  • OP claims "data structures that are not practical for database storage" -- it's almost certainly not true, but should probably be addressed in this answer. – slim May 23 '17 at 11:00