What is an overhead for creating Java objects from lines of csv file

Question

the code reads lines of CSV file like:

Stream<String> strings = Files.lines(Paths.get(filePath))

then it maps each line in the mapper:

List<String> tokens = line.split(","); return new UserModel(tokens.get(0), tokens.get(1), tokens.get(2), tokens.get(3));

and finally collects it:

Set<UserModel> current = currentStream.collect(toSet())

File size is ~500MB I've connected to the server using jconsole and see that heap size grew from 200MB to 1.8GB while processing.

I can't understand where this x3 memory usage came from - I expected something like 500MB spike or so?

My first impression was it's because there is no throttling and garbage collector simply doesn't have enough time for cleanup. But I've tried to use guava rate limiter to let garbage collector time to do it's job but result is the same.

And why read the entire file into memory? Process it a line at a time. — user207421, Jul 01 '19 at 00:39
@user207421 A method that returns `Stream` is not reading the entire file into memory. — VGR, Jul 01 '19 at 01:48
@user207421 according to google: In computer science, in the context of data storage, serialization is the process of translating data structures or object state into a format that can be stored or transmitted and reconstructed later. — Alex Kamornikov, Jul 01 '19 at 06:27
If you are looking to reduce the memory usage, see [my Answer](https://stackoverflow.com/a/56471153/642706) to a similar Question where I show the use of [*Apache Commons CSV*](https://commons.apache.org/proper/commons-csv/) library using a `BufferedReader` to gradually read in the file rather than loading the entire file at once. You will save a half gig of memory by not reading in the entire file. However, regardless of how you read, a collection of objects will always take more octets than the plain text of a CSV file as described in the Answers. — Basil Bourque, Jul 03 '19 at 22:17

score 2 · Accepted Answer · answered Jul 01 '19 at 07:06

Tom Hawtin made good points - I just wanna expand on them and provide a bit more details.

Java Strings take at least 40 bytes of memory (that's for empty string) due to java object header (see later) overhead and an internal byte array. That means the minimal size for non-empty string (1 or more characters) is 48 bytes.

Nowawadays, JVM uses Compact Strings which means that ASCII-only strings only occupy 1 byte per character - before it was 2 bytes per char minimum. That means if your file contains characters beyond ASCII set, then memory usage can grow significantly.

Streams also have more overhead compared to plain iteration with arrays/lists (see here Java 8 stream objects significant memory usage)

I guess your UserModel object adds at least 32 bytes overhead on top of each line, because:

the minimum size of java object is 16 bytes where first 12 bytes are the JVM "overhead": object's class reference (4 bytes when Compressed Oops are used) + the Mark word (used for identity hash code, Biased locking, garbage collectors)
and the next 4 bytes are used by the reference to the first "token"
and the next 12 bytes are used by 3 references to the second, third and fourth "token"
and the last 4 bytes are required due to Java Object Alignment at 8-byte boundaries (on 64-bit architectures)

That being said, it's not clear whether you even use all the data that you read from the file - you parse 4 tokens from a line but maybe there are more? Moreover, you didn't mention how exactly the heap size "grew" - If it was the commited size or the used size of the heap. The used portion is what actually is being "used" by live objects, the commited portion is what has been allocated by the JVM at some point but could be garbage-collected later; used < commited in most cases.

You'd have to take a heap snapshot to find out how much memory actually the result set of UserModel occupies and that would actually be interesting to compare to the size of the file.

Thanks for the explanations. I've take a look at heap dump and found that there are > 10_000_000(my file has 10_000_000 lines) HashMap.Node instances. They are allocating ~500MBs. Looks like all it comes from the Set. Other overhead is coming from UserModel objects - exactly 10_000_000 and it takes exactly 480MBs - each object has 48 bytes size. Char[] and String takes 400MB and 280MB respectively. — Alex Kamornikov, Jul 01 '19 at 08:42
The memory overhead of a single Stream instance is irrelevant. — Holger, Jul 01 '19 at 13:14

score 1 · Answer 2 · answered Jun 30 '19 at 21:54

It may be that the String implementation is using UTF-16 whereas the file may be using UTF-8. That would be double the size assuming all US ASCII characters. However, I believe JVM tend to use a compact form for Strings nowadays.

Another factor is that Java objects tend to be allocated on a nice round address. That means there's extra padding.

Then there's memory for the actual String object, in addition to the actual data in the backing char[] or byte[].

Then there's your UserModel object. Each object has a header and references are usually 8-bytes (may be 4).

Lastly not all the heap will be allocated. GC runs more efficiently when a fair proportion of the memory isn't, at any particular moment, being used. Even C malloc will end up with much of the memory unused once a process is up and running.

score 0 · Answer 3 · answered Jun 30 '19 at 20:12

You code reads the full file into memory. Then you start splitting each line into an array, then you create objects of your custom class for each line. So basically you have 3 different pieces of "memory usage" for each line in your file!

While enough memory is available, the jvm might simply not waste time running the garbage collector while turning your 500 megabytes into three different representations. Therefore you are likely to "triplicate" the number of bytes within your file. At least until the gc kicks in and throws away the no longer required file lines and splitted arrays.

What is an overhead for creating Java objects from lines of csv file

3 Answers3