15

I work on a small company where I work to build some banking software. Now, I have to build some data structure like:

Array [Int-Max] [2] // Large 2D array

Save that to disk and load it next day for future work.

Now, as I only know Java (and little bit C), they always insist me to use C++ or C. As per their suggestion:

  1. They have seen Array [Int-Max] [2] in Java will take nearly 1.5 times more memory than C and C++ takes some what reasonable memory footprint than Java.

  2. C and C++ can handle arbitrarily large files where as Java can't.

As per their suggestion, as database/data-structure become large Java just becomes infeasible. As we have to work on such large database/data-structure, C/C++ is always preferable.

Now my question is,

  1. Why is C or C++ always preferable on large database/data-structure over Java ? Because, C may be, but C++ is also an OOP. So, how it get advantage over Java ?

  2. Should I stay on Java or their suggestion (switch to C++) will be helpful in future on large database/data-structure environment ? Any suggestion ?

Sorry, I have very few knowledge of all those and just started to work on a project, so really confused. Because until now I have just build some school project, have no idea about relatively large project.

Arpssss
  • 3,850
  • 6
  • 36
  • 80
  • 1
    What is the *type* of the array? If you use *primitives* I doubt the memory usage will be so significant. – amit Aug 22 '12 at 12:40
  • 1
    @amit, object of any type. Before going to do that, I just want to get idea, how much I can --. Because I have really short time :). – Arpssss Aug 22 '12 at 12:42
  • 2
    Also, you can profile your code with the expected array size and a stub algorithm before implementing the core and profile it to see what the real difference is expected to be. (Assuming the array is indeed the expected main space consumer) – amit Aug 22 '12 at 12:42
  • 2
    I'm concerned that there may be more in terms of requirements at hand here than we (as readers) know about. That is, why is it essential that such a large 2D array be declared? Aren't there other implementations, eg sparse arrays, etc that inherently wouldn't take as much mem? As far as "large" files go, define "large?" Random access files could, theoretically, be arbitrarily large, with the underlying file sys, then hardware, affecting performance as much as anything. Very broadly, interpreted Java bytecode would likely include a performance penalty over compiled C++...Lots of variables here. – David W Aug 22 '12 at 12:45
  • 3
    One more thing. There is another issue when allocating small objects in C++ in arrays vs java. In C++, you allocate an array of objects - and they are contiguous in the memory, while in java - the objects themselves aren't. In some cases, it might cause the C++ to have much better performance, because it is much more cache efficient then the java program. I once addressed this issue in [this thread](http://stackoverflow.com/q/9632288/572670) – amit Aug 22 '12 at 12:47
  • @DavidW, Large files means say > 8/9 GB. And we need it because for something like this http://stackoverflow.com/questions/11765517/java-custom-hash-map-table-some-points. – Arpssss Aug 22 '12 at 12:50
  • Let's not forget that in Java you need to specify the maximum heap size before the VM launches. – Lubo Antonov Aug 22 '12 at 12:52
  • @Arpssss, if you are indeed talking about a database, you would never design it to load everything in memory - usually your database will be larger than the amount of physical memory you have available. – Lubo Antonov Aug 22 '12 at 12:54
  • I don't get the *"They have seen `Array [Int-Max] [Int-Max]` in Java will take nearly 1.5 times more memory than C and C++ takes"* part... could you give an example of the file that's saved to the disk so we can see how that's possible? – user541686 Aug 22 '12 at 12:59
  • @Mehrdad, sorry you mean to upload the file somewhere to check ? – Arpssss Aug 22 '12 at 13:01
  • @Arpssss: Either that, or just put it into words in more detail... I'm just having trouble figuring out what the file would look like. For example, is the file in binary or ASCII? If it's in ASCII, is this what it looks like? Decimal or hexadecimal? Like this? `{123, 456, 789, 10} 4294967296 4294967296` And how big is each array, approximately? (2 elements? 10? 1000+?) – user541686 Aug 22 '12 at 13:05
  • @Mehrdad, OK. Just let me check, I am updating it in few minutes (size). – Arpssss Aug 22 '12 at 13:09
  • @Arpssss: What is the data type stored in the array, what are the dimensions? – David Rodríguez - dribeas Aug 22 '12 at 13:15
  • @Mehrdad, It has 3,195,529,558 bytes and contains nearly 700 millions of URL's. – Arpssss Aug 22 '12 at 13:15
  • @DavidRodríguez-dribeas, objects of generic type, dimentions are 2D of Int-Max. – Arpssss Aug 22 '12 at 13:16
  • @Arpssss: Oh, so the array isn't of integers! That's very useful to know... of course it can use more memory in Java then... if nothing else, `char` in Java is 2 bytes, so unless you explicitly save the URLs as UTF-8 (or unless you do the C++ version in UTF-16), they should take up 2x the memory than in C++... yeah, now it makes more sense. – user541686 Aug 22 '12 at 13:17
  • @Arpssss: Objects of generic type means that they must be `Object` (or derived from it, but ultimately reference types). If you were to store a single `int`, the overhead of `Integer` would be roughly 3 pointers to one `int` (i.e. about 4x the memory), and you will need to add the reference in the array for a factor of approximately 5x the memory for an `Integer` vs. an array of `int` in C/C++. Of course, for larger types the overhead will be relatively smaller. – David Rodríguez - dribeas Aug 22 '12 at 13:20

3 Answers3

20

why C/C++ is always preferable on large database/data-structure over Java ? Because, C may be, but C++ is also an OOP. So, how it get advantage over Java ?

Remember that a java array (of objects)1 is actually an array of references. For simplicity let's look at a 1D array:

java:

[ref1,ref2,ref3,...,refN]
ref1 -> object1
ref2 -> object2
...
refN -> objectN

c++:

[object1,object2,...,objectN]

The overhead of references is not needed in the array when using the C++ version, the array holds the objects themselves - and not only their references. If the objects are small - this overhead might indeed be significant.

Also, as I already stated in comments - there is another issue when allocating small objects in C++ in arrays vs java. In C++, you allocate an array of objects - and they are contiguous in the memory, while in java - the objects themselves aren't. In some cases, it might cause the C++ to have much better performance, because it is much more cache efficient then the java program. I once addressed this issue in this thread

2) Should I stay on Java or their suggestion (switch to C++) will be helpful in future on large database/data-structure environment ? Any suggestion ?

I don't believe we can answer it for you. You should be aware of all pros and cons (memory efficiency, libraries you can use, development time, ...) of each for your purpose and make a decision. Don't be afraid to get advises from seniors developers in your company who have more information about the system then we are.
If there was a simple easy and generic answer to this questions - we engineers were not needed, wouldn't we?

You can also profile your code with the expected array size and a stub algorithm before implementing the core and profile it to see what the real difference is expected to be. (Assuming the array is indeed the expected main space consumer)


1: The overhead I am describing next is not relevant for arrays of primitives. In these cases (primitives) the arrays are arrays of values, and not of references, same as C++, with minor overhead for the array itself (length field, for example).

Community
  • 1
  • 1
amit
  • 175,853
  • 27
  • 231
  • 333
  • Thanks a lot. But, one thing why java do this: [ref1,ref2,ref3,...,refN] ref1 -> object1 ref2 -> object2 extra step ? Means they can do like C++. – Arpssss Aug 22 '12 at 13:06
  • In java, the arrays hold *references*, not the *objects* themselves (remember, when you allocate an array you do `Object[] arr = new Object[N]; arr[i] = new Object();` is assigning the exact reference to the array). Why did the designers chose this solution for arrays? It's a too complex question to be answered in comments I'm afraid – amit Aug 22 '12 at 13:08
  • 1
    Uhm, it's worth clarifying that primitives in Java (like `int`) are value types, just like in C++, so whether that's an issue depends on the array type. – user541686 Aug 22 '12 at 13:08
  • @Mehrdad: Yeap, this is why I explicitly mentioned the arrays are of objects. The comments (to the question) indicates that the array is an array of *objects* – amit Aug 22 '12 at 13:08
  • @Mehrdad: Yes and no... a bidimensional array is really an array of arrays. Even if the inner type is an `int`, it would be an array of references to arrays of `int`. Of course it will be worse if the data type stored is not a primitive type, but even in the best scenario there is overhead. (I understand why you commented that, but I feel it is better to clear for others where it matters and where it doesn't) – David Rodríguez - dribeas Aug 22 '12 at 13:16
  • @DavidRodríguez-dribeas: Yeah, I know how arrays work in Java lol. :P I was talking about primitive arrays, but for 2D, the two-dimensionality is irrelevant, unless every sub-array has only a few elements (say 4) or something like that... it's unlikely that a reasonably sized 2D array will have a 50% difference in size in C++ vs Java unless its subarrays are **really** small... as you already know too. ;) It's not the question of "is there overhead?" but rather "is there a noticeable % of overhead?" and I really doubt that's near 50% if his arrays are reasonably sized. – user541686 Aug 22 '12 at 13:21
  • @amit, is it possible to give some pointers on "Why did the designers chose this solution for arrays?" to start. – Arpssss Aug 22 '12 at 13:27
  • 2
    @Arpssss: In java you cannot "hold" an object, each variable is a primitive or a reference to an object (and not the object itself). Same applies for each element in an array (`arr[x]`). One reason for it - is it makes pointers not needed (and indeed there are no pointers in java). Pointers in C are a lot of headaches for programmers, and as someone described it to me: "It allows the programmer to shoot both his legs." There is much more into it, but it is not the right forum for it I am afraid. – amit Aug 22 '12 at 13:32
  • Not a .NET expert, but don't they have custom value/primitive types which work in arrays, generics etc like in C/C++? Always found it annoying that java made say a 2 member object a lot less efficient than primitives in the array situations – Fire Lancer Aug 22 '12 at 13:33
  • 2
    @Arpssss: Simplicity of the language is one such reason. Using arrays that embed the type requires that the types (not the references, the real types) are assignable, that in turn means that you need to be able to generate assignment operators (otherwise arrays would be limited to the types that have an user defined assignment), and provide the means for users to override the behavior for the cases where they need deep copies. The very next issue is that you can no longer have `null` stored in arrays, which in turn means that types must provide a default constructor to be usable in arrays... – David Rodríguez - dribeas Aug 22 '12 at 13:34
  • 1
    ... the complications continue and you end up changing Java from being a language with *reference* semantics to being a language with *value* semantics. The complexity grows and you end up with a beast that is no better than C++ and is probably worse. – David Rodríguez - dribeas Aug 22 '12 at 13:35
9

It sounds like you are in inexperienced programmer in a new job. The chances are that "they" have been in the business a long time, and know (or at least think they know) more about the domain and its programming requirements than you do.

My advice is to just do what they insist that you do. If they want the code in C or C++, just write it in C or C++. If you think you are going to have difficulties because you don't know much C / C++ ... warn them up front. If they still insist, they can wear the responsibility for any problems and delays their insistence causes. Just make sure that you do your best ... and try not to be a "squeaky wheel".


1) They have seen Array [Int-Max] [Int-Max] in Java will take nearly 1.5 times more memory than C and C++ takes some what reasonable memory footprint than Java.

That is feasible, though it depends on what is in the arrays.

  • Java can represent large arrays of most primitive types using close to optimal amounts of memory.

  • On the other hand, arrays of objects in Java can take considerably more space than in C / C++. In C++ for example, you would typically allocate a large array using new Foo[largeNumber] so that all of the Foo instances are part of the array instance. In Java, new Foo[largeNumber] is actually equivalent to new Foo*[largeNumber]; i.e. an array of pointers, where each pointer typically refers to a different object / heap node. It is easy to see how this can take a lot more space.

2) C/C++ can handle arbitrarily large file where as Java can't.

There is a hard limit to the number of elements in a single 1-D Java array ... 2^31. (You can work around this limit, but it will make your code more complicated.)

On the other hand if you are talking about simply reading and writing files, Java can handle individual files up to 2^63 bytes ... which is more than you could possibly ever want.

1) why C/C++ is always preferable on large database/data-structure over Java ? Because, C may be, but C++ is also an OOP. So, how it get advantage over Java ?

Because of the hard limit. The limit is part of the JLS and the JVM specification. It is nothing to do with OOP per se.

2) Should I stay on Java or their suggestion (switch to C++) will be helpful in future on large database/data-structure environment ? Any suggestion ?

Go with their suggestion. If you are dealing with in-memory datasets that are that large, then their concerns are valid. And even if their concerns are (hypothetically) a bit overblown it is not a good thing to be battling your superiors / seniors ...

Stephen C
  • 698,415
  • 94
  • 811
  • 1,216
7

1) They have seen Array [Int-Max] [Int-Max] in Java will take nearly 1.5 times more memory than C and C++ takes some what reasonable memory footprint than Java.

That depends on the situation. If you create an new int[1] or new int[1000] there is almost no difference in Java or C++. If you allocate data on the stack, it has a high relative difference as Java doesn't use the stack for such data.

I would first ensure this is not micro-tuning the application. Its worth remembering that one day of your time is worth (assuming you get minimum wage) is about 2.5 GB. So unless you are saving 2.5 GB per day by doing this, suspect its not worth chasing.

2) C/C++ can handle arbitrarily large file where as Java can't.

I have memory mapped a 8 TB file in a pure Java program, so I have no idea what this is about.

There is a limit where you cannot map more than 2 GB or have more than 2 billion elements in an array. You can work around this by having more than one (e.g. up to 2 billion of those)

As we have to work on such large database/data-structure, C/C++ is always preferable.

I regularly load 200 - 800 GB of data with over 5 billion entries into a single Java process (sometime more than one at a time on the same machine)

1) why C/C++ is always preferable on large database/data-structure over Java ?

There is more experience on how to do this in C/C++ than there is in Java, and their experience of how to do this is only in C/C++.

Because, C may be, but C++ is also an OOP. So, how it get advantage over Java ?

When using large datasets, its more common to use a separate database in the Java world (embedded databases are relatively rare)

Java just calls the same system calls you can in C, so there is no real difference in terms of what you can do.

2) Should I stay on Java or their suggestion (switch to C++) will be helpful in future on large database/data-structure environment ? Any suggestion ?

At the end of the day, they pay you and sometimes technical arguments are not really what matters. ;)

Peter Lawrey
  • 525,659
  • 79
  • 751
  • 1,130