1

My main program looks as below (pseudo code):

public void main(String[] args) {

    // produce lots of int[] data which is stored inside a list of hashmaps
    List<HashMap<Integer, int[]>> dataArray1 = new
                                    ArrayList<HashMap<Integer, int[]>>();
    ...

    // create a new list of data, similar to dataArray1
    // now we will write into dataArray2 and read from dataArray1
    List<HashMap<Integer, int[]>> dataArray2 = new
                                    ArrayList<HashMap<Integer, int[]>>();
    while (true) {
        if (exitCondition) break;
        ...
        for index1, index2 in a set of indices {
            int[] a1 = dataArray1.get(index1).get(key1);
            int[] a2 = dataArray1.get(index2).get(key2);
            int[] b = intersect a1 and a2;
            int i = generateIndex(index1, index2);
            int key = generateKey(key1, key2);
            dataArray2.get(i).put(key, b);
        }
    }

    // now we can remove dataArray1
    dataArray1 = null;

    // create a new list of data, similar to dataArray2
    // now we will write into dataArray3 and read from dataArray2
    List<HashMap<Integer, int[]>> dataArray3 = new
                                    ArrayList<HashMap<Integer, int[]>>();
    while (true) {
        if (exitCondition) break;
        ...
        for index1, index2 in a set of indices {
            int[] a1 = dataArray2.get(index1).get(key1);
            int[] a2 = dataArray2.get(index2).get(key2);
            int[] b = intersect a1 and a2;
            int i = generateIndex(index1, index2);
            int key = generateKey(key1, key2);
            dataArray3.get(i).put(key, b);
        }
    }

    // now we can remove dataArray2
    dataArray2 = null;

    ...
    // and so on 20 times

}

My problem is that at some point dataArrayk for some k > 1 becomes heavy (say 20 Gb) thus storing it in memory is impossible. I can change int[] onto BitSet but this does not help, memory is spent even more.

The solution would be to use either Database or FileSystem. What would you recommend to use? I need performance (time execution), memory does not matter. If your experience says Database, then please recommend the fastest interface for dealing with specific (which?) database, be it bd4 (Berkeley db), postgresql or whatever. If it says FileSystem, then please recommend the fastest interface (File libraries).

As for statistics of read and writes: In each while loop of my code I do 3 times more reads than writes, for example: for one level k I read from dataArray_k 12000 times and write into dataArray_(k+1) 4000 times.

I can store each hashmap from List<HashMap<Integer, int[]>> dataArray1 in separate file.

tshepang
  • 12,111
  • 21
  • 91
  • 136
Sophie Sperner
  • 4,428
  • 8
  • 35
  • 55
  • 3
    it would make much more sense in telling us what yoi are trying to do. 20GB of data is A LOT – Eugene Apr 17 '13 at 09:24
  • Simple answer - I have a million of arrays that represent biological cell sensor data, which I need to intersect in some way, so in each `dataArrayk` I store all the possible intersections of arrays taken from the previous `dataArray_k-1`. My program is optimized, so no problems that it takes so much memory. – Sophie Sperner Apr 17 '13 at 09:25
  • it depnds of your bio algorithm, if calculations reference only values nearby, the disk approach will work. if calulcation access always all values it would be better to buy more memory, together with optimizing, e.g short array. – AlexWien Apr 17 '13 at 09:33
  • Can your bio data be somehow considered 2d- dimensional? or 3-d ? – AlexWien Apr 17 '13 at 09:34
  • I smell disk-based storage with a huge cache makes sense. So what are the access paths to the data? How performance-critical is this? Are particular chunks of data requested more often then others? Do you need transaction management? Is the pseudo code shown the only operation you do in that app with the data? Before firing up a db and designing a schema etc etc I´d think about a file-based approach with an intelligent data structure and some kind of memory-based buffering/caching/readahead mechanism. – TheBlastOne Apr 17 '13 at 09:42
  • You might also want to look for some clever way to compress the in-memory array data in some way, like replacing neighboring elements that share the same value by one special element and a count. Then, the data might fit into memory without loosing info, and you can stick to the current approach. – TheBlastOne Apr 17 '13 at 09:43
  • I too siggest, try all to avoid disk access, because then, the fun is over, calc time may decrease by a factor of 1000 – AlexWien Apr 17 '13 at 10:02
  • On one hand, you say that "storing it in memory is impossible", yet "memory does not matter". What prevents you from jamming 64GB more into the machine and running your perfectly fine algorithm? No amount of storage caching will make up for memory access times. – 0xCAFEBABE Apr 17 '13 at 10:17
  • Thank you people, I need to think a lot. But your point that file access will be too slow is taken and will be considered. – Sophie Sperner Apr 17 '13 at 11:33
  • As I have mentioned before you can buy 32 GB for less than $300, you can also my fast SSD for less than $100 to speed up disk access. – Peter Lawrey Apr 17 '13 at 11:44
  • Better to use amazon or something similar than buy. – Sophie Sperner Apr 17 '13 at 11:50

2 Answers2

4

Yesterday I made an evaluation of read performance of the different java io/nio technics. It turned out that on pc the Memory Map provided by java.nio together with IntBuffer had the best read performance. Details with code here: Fastest way to read huge number of int from binary file

Of course it also turned out that algorithmic changes have much more potential to improve speed. E.g in your case consider multi dimensional search sructures like quad tree or R* tree to reduce disk access to bio data that are sonehow closely related.

Update: as I see your code now, it seems that you always iterate over all values (that is not realy clear). try first to use a short array, that needs half the space.

Community
  • 1
  • 1
AlexWien
  • 28,470
  • 6
  • 53
  • 83
  • 2
    @TheBlastOne if you write +1 dont forget to click the arrow button ;-) – AlexWien Apr 17 '13 at 10:03
  • 1
    For a millisecond, I wanted to comment "downvoter!!! shame on you", but yes no click no upvote.....doh... – TheBlastOne Apr 17 '13 at 10:04
  • 2
    +1 I have written a library to make it easier to work with memory mapped files. including sharing them between processes. It can handle data in the TB without using much heap (<< 1 MB) https://github.com/peter-lawrey/Java-Chronicle BTW using `long`s might be faster than using `int`s. – Peter Lawrey Apr 17 '13 at 11:41
0

Honestly, reading that much data with Java is probably going to be a nightmare. I only worked up to 5 GB of text files and it was REALLY slow and difficult. You could use something closer to the OS(sed,grep,find,etc). If Java is A MUST then NIO packages I suppose will be faster then simple File

Look here

Community
  • 1
  • 1
Eugene
  • 117,005
  • 15
  • 201
  • 306