3

We have a class called Row which represents a row in a result set. We need to write a List<Row> to file so that it can be retrieved much later.

One way to do this is by using Java's serialization support.

I imagine the best way is to implement serialization inside the Row class. Then we would use the serialize method of List<Row>, in order to write to file.

I wanted to know, how efficient would this be? Would it take up far more space than simply writing a CSV file adapter that converts our List<Row> object to a CSV file?

ktm5124
  • 11,861
  • 21
  • 74
  • 119
  • That's an interesting question, there is a `long serialVersionUID` (so every row would be padded by a minimum of 64-bits). You could use `Externalizable` (and you could **pack** the data). If you want a more complete answer, I suggest posting some code (and a benchmark). – Elliott Frisch Jun 25 '16 at 00:09
  • You may be interested in [this question](http://stackoverflow.com/q/515631/5743988). (Possible dupe?) It deals more with speed (which IMO is more important), but they mention file size too. – 4castle Jun 25 '16 at 00:12
  • 2
    @ElliotFrisch The `serialVersionUID` isn't transmitted with every object. It is transmitted once per `newClassDesc`, which is transmitted once per class per stream. – user207421 Jun 25 '16 at 00:20
  • Am I right to assume that if I implement serialization in the `Row` class, then the `List.serialize()` method will do the trick? – ktm5124 Jun 25 '16 at 00:26
  • @ktm5124 no, this would never compile. By the way, serialization is more a way of communicating objects accross different machines than a storage method... If you need to store this for long there are probably much better ways to do this (other formats, database etc) – Dici Jun 25 '16 at 00:31
  • @ktm5124 No, because there is no such method. What you're looking for is `ObjectOutputStream.writeObject(list)`. – user207421 Jun 25 '16 at 00:53
  • @NJP And also because `List.serialize()` is not even syntactically correct in Java – Dici Jun 25 '16 at 00:54
  • 1
    @Dici That's an accepted shorthand to *name* a method in documentation. It doesn't have to compile. `Socket.close()` is another example. – user207421 Jun 25 '16 at 01:03
  • @EJP I do know that, but I've never seen a generic type used like this and I don't believe this is correct unless you have a concrete example from the doc to show me – Dici Jun 25 '16 at 10:06
  • By the way `Socket.close()` is synctactically correct, `List.serialize()` isn't. – Dici Jun 25 '16 at 10:28
  • @Dici I don't need a concrete example to show you. It's a simple extrapolation. How else could you possibly write it? `Socket.close()` won't compile, so whether it is syntactically correct is irrelevant. – user207421 Jun 26 '16 at 02:06
  • @EJP I just wanted to stress how extravagant the suggestion of using `List.serialize()` was since it does not even *look like* Java code. I don't think this was used as a shorthand at all in this case. If it was some conventional or at least synctatically correct shorthand I would be less inclined to think it was a purely random question like `Will Magic.debug() fix my program ?`. Anyway... let's forget that ever happened. **PS:** I would write it as `List. serialize()` or just `List.serialize()` because generics aren't part of the actual signature in most cases – Dici Jun 26 '16 at 02:29

2 Answers2

4

Would it take up far more space than simply writing a CSV file adapter that converts our List object to a CSV file?

It depends on the type of Row, and also on the size and other aspects of the data you are saving1.

On the one hand, the Java serialization protocol includes metadata for every class mentioned in the serialization. This takes significant space.

On the other hand:

  • Java serialization only includes the metadata once per serialization. So if you serialize lots of instances of the same class, the metadata cost becomes insignificant.
  • In a CSV file, all non textual data has to be converted to text. In some cases (e.g. large numbers, floating point numbers, booleans) the textual representation will be larger than the binary representation that is used in Java serialization.

1 - For example, an array of random numbers versus an array of zeros and ones. Java serialization will be better in the first case, and CSV will be better in the second.


But I think that you are probably focusing on the wrong thing here:

  • Unless the files you are generating are enormous, the size probably doesn't matter. Disk space is cheap.
  • The files are likely to be compressible in either case, with the less dense form probably being more compressible.
  • What matters more is whether the representation is suitable for purpose; e.g.
    • Do you want it to be human readable?
    • Do you want it to be readable by non-Java programs, including shell scripts?
    • Do you need to worry about changes to your Java code introducing class versus serialization version problems?
    • Do you want to be able to stream the data? (When writing or reading.)
Stephen C
  • 698,415
  • 94
  • 811
  • 1,216
  • Maybe also worth to note that CSV will have problems, if the content itself contains the delimiter (comma in this case). Wouldn't be any problem for the serialization approach. – Tom Jun 25 '16 at 01:04
  • 2
    @Tom actually no. Check this answer http://stackoverflow.com/a/769675 also take a look at [Csv Schema](http://digital-preservation.github.io/csv-schema/csv-schema-1.0.html#basics) – Onur Jun 25 '16 at 01:07
  • 1
    @Tom - Well-formed CSV deals with that case by quoting. However, there is another problem that there are many variants of CSV, and you need to know which one you are using to read it correctly. – Stephen C Jun 25 '16 at 01:20
  • @Onur Then I wonder why I had to deal with so many 'invalid' CSVs files in the past :(. Btw, the first link only says "should" and I couldn't find anything about this matter in the second link. – Tom Jun 25 '16 at 01:20
  • 2
    @Tom - Because the files were written by applications coded by people who didn't know what they were doing :-) Hint: don't code CSV readers / writers by hand. Use a library, if you can. – Stephen C Jun 25 '16 at 01:21
  • Too bad this happen so often :D. *"don't code CSV readers / writers by hand. Use a library, if you can."* I now wonder why you think that _I_ have written invalid CSV files ... – Tom Jun 25 '16 at 01:23
  • 1
    Hint: @Tom - you are not the only person reading these comments. – Stephen C Jun 25 '16 at 01:46
  • But I'm the only person you addressed using `@Tom` ;P. – Tom Jun 25 '16 at 01:50
  • 1
    The @Tom doesn't mean that I am >>only<< talking to you. That is not how it is used. – Stephen C Jun 25 '16 at 02:14
3

Java serialization will be less space efficient in certain cases than simply writing to a CSV file because it stores extra metadata to identify class-types.

I verified such a scenario with two simple test programs. The first one writes an array of ints to a .csv file.

import java.io.*;

public class CSVDemo {
  public static void main(String [] args) {
    try {
       PrintWriter pw = new PrintWriter(new File("dummy.csv"));
       StringBuilder sb = new StringBuilder();
       for(int i = 0; i < 1000; i++){
         sb.append(1);
         sb.append(",");
       }

       pw.write(sb.toString());
       pw.close();
       System.out.printf("Data is saved in dummy.csv");
    } catch(FileNotFoundException e) {
        e.printStackTrace();
    }
  }
}

The second one serializes an object containing an array of ints to a .ser file.

import java.io.*;

public class SerializeDemo
{
   public static void main(String [] args)
   {
      DummyData dummy = new DummyData();

      try {
         FileOutputStream fileOut = new FileOutputStream("dummy.ser");
         ObjectOutputStream out = new ObjectOutputStream(fileOut);
         out.writeObject(dummy);
         out.close();
         fileOut.close();
         System.out.printf("Serialized data is saved in dummy.ser");
      } catch(IOException i) {
          i.printStackTrace();
      }
   }

   public static class DummyData implements java.io.Serializable{
     int[] data = new int[1000];
     public DummyData(){
       for(int i = 0; i < 1000; i++){
         data[i] = 1;
       }
     }
   }
}

The .ser file took 4079 bytes. The .csv file took 2000 bytes. Granted, this is a slight simplification of your use case (I'm equating an int to your Row type), but the general trend should be the same.

Trying with larger numbers yields the same result. Using 100000 ints results in ~400KB for .ser and 200KB for .csv

However, as the below comment pointed out, if choosing random values for ints, the .csv actually grows larger.

adao7000
  • 3,632
  • 2
  • 28
  • 37
  • Actually you have a slight error on your csv file. CSV uses "," to seperate columns, you need to use "\r\n" for rows. So `sb.append(",");` will be `sb.append("\r\n");` result will be a 3000 bytes file instead 2000 bytes. – Onur Jun 25 '16 at 00:48
  • 2
    Note that the properties of this particular example make serialization worse. If instead you serialized a large (enough) array of `int` values chosen using `Random.nextInt()`, the CSV form would be larger. – Stephen C Jun 25 '16 at 00:54
  • 1
    To prove @StephenC 's comment [Test with random values](https://ideone.com/kz1BTR) – Onur Jun 25 '16 at 01:24
  • @StephenC That's interesting, I had no idea. Why is that? I've modified my answer to reflect your comment. Thanks! – adao7000 Jun 25 '16 at 03:55
  • @adao7000 I'm assuming it's because in the serialized format an `int` will always take 32 bits while stored as a character, a single-digit integer will only weight 8 bits. As soon as you start storing numbers strictly bigger than 999 the CSV format becomes more wasteful than binary. Considering there are much more numbers between 1000 and `2^32 - 1` than between 0 and 999, it is indeed way better to use binary format to store an integer array – Dici Jun 26 '16 at 02:40
  • By the way, you can verify what I said pretty easily: in the CSV file you store 1000 commas and 1000 1's, which is 2000 characters => 2000 bytes. In the serialized form, you store 1000 32-bits integers => 4000 bytes. The 78 remaining bytes must come, I presume, from some metadata about the class that has been serialized (probably the string `"[package.name].SerializeDemo.DummyData"` and maybe some other things) – Dici Jun 26 '16 at 02:45