1

I have a data set saved as a text file that basically contains a vectors stored line by line. My vector is 10k in dimensions and I have 250 such vectors. Each vector entry is a double. Here's an example:

Vector 1 -> 0.0 0.0 0.0 0.439367 0.0 .....10k such entries

Vector 2 -> 0.0 0.0 0.0 0.439367 0.0 0.0 0.0 0.0 .....10k such entries

...

...

Vector 250 -> 0.0 1.203973 0.0 0.0 0.0 .....10k such entries

Now if I do the math, this should take up 10k X 16bytes X 250 space (assuming each vector entry is a double taking up 16bytes of space) which is ~40MB of space. However I see that the file size is shown as 9.8MB only. Am I going wrong somewhere?

The thing is I am using this data in my Java code. The space complexity of my algorithm is O(no of entries in the vector X no of entries). Even when I run my code by allocating like 4GB of memory, I still run out of heap space. What am I missing?

Thanks. Andy

trincot
  • 317,000
  • 35
  • 244
  • 286
Andy
  • 328
  • 3
  • 13
  • 6
    Impossible to say without seeing the code. – Jonathon Faust Feb 15 '11 at 19:04
  • By the way, `double` is only 8 bytes long but this is largely irrelevant while we can't see the code. – biziclop Feb 15 '11 at 19:06
  • We are missing the source code. From your description alone it looks as if 128 MB should well be enough. Do you have problems with loading the data, or with later processing? – Roland Illig Feb 15 '11 at 19:06
  • @biziclop: Maybe Andy is using a `java.lang.Double` in a `Vector`. Then the 16 bytes would be correct for a HotSpot Java VM in 32-bit mode. – Roland Illig Feb 15 '11 at 19:08
  • I guess it is 16bytes because of 64bit? Or he is wrong. – Thomas Jungblut Feb 15 '11 at 19:09
  • Hello everybody, thank you for your answers. Unfortunately, I am not able to post the source code since I am under license agreement with another academic institution and it is not yet released publicly. What I am doing is actually reading in all these vectors from another file - I am basically reading a vector as a string and writing as a string... – Andy Feb 15 '11 at 19:12
  • Please post the java command line parameter, because 1GB should be enough neverless if you use Double or double (or...) - so I guess that the problem is may not the code but the fact that the paramter to increase the heap does not work. – Ralph Feb 15 '11 at 19:13
  • @Andy In that case, don't send the source, but create a similar problem from the scratch to isolate the problem so we can see what your're doing wrong. The stacktrace is also relevant here, probably you've got an infinite loop? – OscarRyz Feb 15 '11 at 19:16
  • @Roland, if he's using `Vector` the the wrapper should be taken into account. `double[]` would be better perhaps. – OscarRyz Feb 15 '11 at 19:17
  • Where are you going wrong with the size estimation? Internally a double is 8 bytes, but when you write it as a string to a text file, this could be 3 bytes (for "0.0") or many many bytes (for an irrational number, e.g.) depending on the format, character set, etc. If most of your values are zero and you're using a compressed encoding like UTF-8, the text file will be smaller than storing it in memory as doubles. – Mark Peters Feb 15 '11 at 19:17
  • @Roland Illig In that case it could be even more, as an `Object` header structure is typically larger than 32 bits. In any case, using a `double []` is definitely more memory-efficient than using a `Vector`. – biziclop Feb 15 '11 at 19:17
  • @Mark, let's hope he's not reading the whole file before processing. But without the source this is impossible to answer. Voting to close as "can't read your mind" – OscarRyz Feb 15 '11 at 19:19
  • The command line I am using reads: ./jre1.6.0_24/bin/java -Xmx5g -jar regression.jar . – Andy Feb 15 '11 at 20:05
  • " but create a similar problem from the scratch to isolate the problem so we can see what your're doing wrong" - not sure what does this mean? – Andy Feb 15 '11 at 20:06
  • @Andy: Show us a simple code that reads a file into memory. The file format should be the same as you described in your posting. That code should work similar to your actual code, so we can see how you try to solve the problem. – Roland Illig Feb 15 '11 at 22:10

5 Answers5

2

After so many people guessing about the size, I have done 3 simple test, and used the Eclipse Memory Analyzer to determine the size. (Win7, 1.6.0_21 Java HotSpot (TM) 64-Bit Server VM)

  • double[][] = Size: 19,2 MB Classes: 328 Objects: 2,7k
  • Double[][] structure = Size: 76,5 MB Classes: 332 Objects: 2,5m
  • ArrayList<ArrayList<Double>> = Size: 79,6 MB Classes: 330 Objects: 2,5m

256MB (java -Xmx256m Huge) was enough to run the tests.

So I guess the problem is not the size, it could be two things:

  • there is a bug in the algorithm
  • the jvm does not run with 4GB

If somebody is interessed in the code:

import java.util.ArrayList;
import java.util.List;

public class Huge {

    private static final int NUMBER_OF_VECTORS = 250;
    private static final int VECTOR_SIZE = 10000;

    //Size: 19,2 MB Classes: 328 Objects: 2,7k 
    public static void doulbeArray() {

        double[][] structure = new double[NUMBER_OF_VECTORS][];

        for(int i = 0; i < NUMBER_OF_VECTORS; i++) {
            structure[i] = new double[VECTOR_SIZE];
        }
    }

    //Size: 76,5 MB Classes: 332 Objects: 2,5m
    public static void doubleWrapperArray() {

        Double[][] structure = new Double[NUMBER_OF_VECTORS][];

        for(int i = 0; i < NUMBER_OF_VECTORS; i++) {
            structure[i] = new Double[VECTOR_SIZE];
            for (int k = 0; k < VECTOR_SIZE; k++) {
                structure[i][k] = Double.valueOf(Math.random());
            }
        }
    }

    //Size: 79,6 MB Classes: 330 Objects: 2,5m 
    public static void list() {

        List<List<Double>> structure = new ArrayList<List<Double>>(); 

        for(int i = 0; i < NUMBER_OF_VECTORS; i++) {
            List<Double> vector = new ArrayList<Double>();            
            for (int k = 0; k < VECTOR_SIZE; k++) {
                vector.add(Double.valueOf(Math.random()));
            }
            structure.add(vector);
        }
    }
}
Ralph
  • 118,862
  • 56
  • 287
  • 383
0

Without seeing the code, I can't say for certain, but it sounds like you're over-allocating when you either a) read the data from the file or b) somewhere in your algorithm. I would advise that you use a tool such as visualVM to review your object allocation- it will be able to tell you how you're allocating and what mistakes you're making.

Steven Fines
  • 467
  • 4
  • 14
0

Now if I do the math, this should take up 10k X 16bytes X 250 space (assuming each vector entry is a double taking up 16bytes of space) which is ~40MB of space. However I see that the file size is shown as 9.8MB only. Am I going wrong somewhere?

Where you're going wrong is the assumption that every double takes 16 bytes of space when saved as text. You seem to have lots of 0 values, which take only 4 bytes in string form (including separator).

Even when I run my code by allocating like 4GB of memory, I still run out of heap space. What am I missing?

That depends on your code. One reason might be that you're storing your data in an ArrayList<Double> or (worse) TreeSet<Double> - the Double wrapper objects will cause a memory overhead of easily 200% - and the Set/Map structures are much worse.

Michael Borgwardt
  • 342,105
  • 78
  • 482
  • 720
0

Hard to say without seeing the code and VM arguments. But note that variables in your algorithm also consume memory. And that file size vs memory usage depends on how you construct your in-memory objects, for example a simple object without a double takes up space on its own.

Get a proper tool for benchmarking memory usage. Check out the TPTP Eclipse distribution.

Also, do you might want to check out sparce matrixes.

Community
  • 1
  • 1
ThomasRS
  • 8,215
  • 5
  • 33
  • 48
0

If we can't see the code (which is fair enough), all I can say is to use the -XX:+HeapDumpOnOutOfMemoryError command line option when you start your application, then analyse the resulting heap dump with jhat.

biziclop
  • 48,926
  • 12
  • 77
  • 104
  • 1
    I believe you mean "-XX:+HeapDumpOnOutOfMemoryError"? – eggsyntax Feb 15 '11 at 19:44
  • @eggsyntax Yes, of course, I'm an idiot. I wanted to avoid making a typo, so I copied it from the official documentation. – biziclop Feb 15 '11 at 19:54
  • Did you ever try to use `jhat` with a large heap? I did, and I had to set the heap size for `jhat` to at least five times the original file size. After waiting for `jhat` for more than one hour, just to parse a heap of approximately 3 GB, I decided that `jhat` is not on my list of preferred tools anymore. Now I use Eclipse with MAT. – Roland Illig Feb 15 '11 at 22:19
  • @Roland Illig Or you could just set your max heap lower for the test. :) – biziclop Feb 15 '11 at 22:39