24

I have a big file, it's expected to be around 12 GB. I want to load it all into memory on a beefy 64-bit machine with 16 GB RAM, but I think Java does not support byte arrays that big:

File f = new File(file);
long size = f.length();
byte data[] = new byte[size]; // <- does not compile, not even on 64bit JVM

Is it possible with Java?

The compile error from the Eclipse compiler is:

Type mismatch: cannot convert from long to int

javac gives:

possible loss of precision
found   : long
required: int
         byte data[] = new byte[size];
Nayuki
  • 17,911
  • 6
  • 53
  • 80
Omry Yadan
  • 31,280
  • 18
  • 64
  • 87
  • 5
    Just curious: Why do you need to keep that much data in memory at the same time? Wouldn't it be possible to split that into chunks? – bruno conde May 18 '09 at 15:39
  • 1
    +1 to bruno's comment. The only way that having the entire file in memory will be a benefit is if you need to make random accesses into different points of the file, and in this case you'd almost certainly be better parsing it into a more computable representation – kdgregory May 18 '09 at 15:58
  • I am going to try to use a prefix tree (trie) to keep the data, this may shrink it enough to fit into 2gb of memory. – Omry Yadan May 19 '09 at 03:08
  • possible duplicate of [converting 'int' to 'long' or accessing too long array with 'long'](http://stackoverflow.com/questions/3943585/converting-int-to-long-or-accessing-too-long-array-with-long) – Leonel Feb 08 '13 at 15:54
  • Whow. Very frustated. Java must solve this in next 5 years. – Chameleon Feb 05 '17 at 11:14

11 Answers11

22

Java array indices are of type int (4 bytes or 32 bits), so I'm afraid you're limited to 231 − 1 or 2147483647 slots in your array. I'd read the data into another data structure, like a 2D array.

Nayuki
  • 17,911
  • 6
  • 53
  • 80
Bill the Lizard
  • 398,270
  • 210
  • 566
  • 880
  • @OmryYadan, The [real limit will actually be less](http://stackoverflow.com/questions/3038392/do-java-arrays-have-a-maximum-size/8381338#comment45805541_3039805) than 2147483647. – Pacerier Nov 26 '15 at 04:15
  • you mean MAX_INT - 8 ? https://github.com/omry/banana/blob/1621638d6eb4db773045af66eac66be0fffa91fa/banana/src/net/yadan/banana/memory/block/BigBlockAllocator.java#L25 – Omry Yadan Nov 26 '15 at 07:16
15
package com.deans.rtl.util;

import java.io.FileInputStream;
import java.io.IOException;

/**
 * 
 * @author william.deans@gmail.com
 *
 * Written to work with byte arrays requiring address space larger than 32 bits. 
 * 
 */

public class ByteArray64 {

    private final long CHUNK_SIZE = 1024*1024*1024; //1GiB

    long size;
    byte [][] data;

    public ByteArray64( long size ) {
        this.size = size;
        if( size == 0 ) {
            data = null;
        } else {
            int chunks = (int)(size/CHUNK_SIZE);
            int remainder = (int)(size - ((long)chunks)*CHUNK_SIZE);
            data = new byte[chunks+(remainder==0?0:1)][];
            for( int idx=chunks; --idx>=0; ) {
                data[idx] = new byte[(int)CHUNK_SIZE];
            }
            if( remainder != 0 ) {
                data[chunks] = new byte[remainder];
            }
        }
    }
    public byte get( long index ) {
        if( index<0 || index>=size ) {
            throw new IndexOutOfBoundsException("Error attempting to access data element "+index+".  Array is "+size+" elements long.");
        }
        int chunk = (int)(index/CHUNK_SIZE);
        int offset = (int)(index - (((long)chunk)*CHUNK_SIZE));
        return data[chunk][offset];
    }
    public void set( long index, byte b ) {
        if( index<0 || index>=size ) {
            throw new IndexOutOfBoundsException("Error attempting to access data element "+index+".  Array is "+size+" elements long.");
        }
        int chunk = (int)(index/CHUNK_SIZE);
        int offset = (int)(index - (((long)chunk)*CHUNK_SIZE));
        data[chunk][offset] = b;
    }
    /**
     * Simulates a single read which fills the entire array via several smaller reads.
     * 
     * @param fileInputStream
     * @throws IOException
     */
    public void read( FileInputStream fileInputStream ) throws IOException {
        if( size == 0 ) {
            return;
        }
        for( int idx=0; idx<data.length; idx++ ) {
            if( fileInputStream.read( data[idx] ) != data[idx].length ) {
                throw new IOException("short read");
            }
        }
    }
    public long size() {
        return size;
    }
}
}
Mr.Wizard
  • 24,179
  • 5
  • 44
  • 125
William Deans
  • 151
  • 1
  • 2
7

If necessary, you can load the data into an array of arrays, which will give you a maximum of int.maxValue squared bytes, more than even the beefiest machine would hold well in memory.

Yes - that Jake.
  • 16,725
  • 14
  • 70
  • 96
  • that would be my next step. since I intend to do a binary search on the data, it will uglify the code, but I`m afraid there is no choice. – Omry Yadan May 18 '09 at 15:34
  • 1
    You could make a class that manages an array of arrays but provides an abstraction similar to a regular array, e.g, with get and set that take a long index. – Jay Conrod May 18 '09 at 15:36
4

You might consider using FileChannel and MappedByteBuffer to memory map the file,

FileChannel fCh = new RandomAccessFile(file,"rw").getChannel();
long size = fCh.size();
ByteBuffer map = fCh.map(FileChannel.MapMode.READ_WRITE, 0, fileSize);

Edit:

Ok, I'm an idiot it looks like ByteBuffer only takes a 32-bit index as well which is odd since the size parameter to FileChannel.map is a long... But if you decide to break up the file into multiple 2Gb chunks for loading I'd still recommend memory mapped IO as there can be pretty large performance benefits. You're basically moving all IO responsibility to the OS kernel.

Jeff Mc
  • 3,723
  • 1
  • 22
  • 27
  • I also hit the same limitation of `ByteBuffer` which I think should be able to deal with long offsets and indexes at least at interface level. Concrete implementation should check ranges explicitly. Unfortunately it is not possible to map more then 2GB file into memory. – dma_k Nov 13 '13 at 14:42
  • Upvote as this is the right way to go, even if you have to partition the data into 2G chunks - wrap the chunks in a class which indexes with a long if you like. – simon.watts Dec 02 '14 at 11:09
  • MappedByteBuffer is also capped at 2GB, practically useless. See http://nyeggen.com/post/2014-05-18-memory-mapping-%3E2gb-of-data-in-java/ for a solution which calls internal JNI methods to workaround this. – jiping-s Jun 07 '16 at 16:18
2

don't limit your self with Integer.MAX_VALUE

although this question has been asked many years ago, but a i wanted to participate with a simple example using only java se without any external libraries

at first let's say it's theoretically impossible but practically possible

a new look : if the array is an object of elements what about having an object that is array of arrays

here's the example

import java.lang.reflect.Array;
import java.util.ArrayList;
import java.util.List;

/**
*
* @author Anosa
*/
 public class BigArray<t>{

private final static int ARRAY_LENGTH = 1000000;

public final long length;
private List<t[]> arrays;

public BigArray(long length, Class<t> glasss)
{
    this.length = length;
    arrays = new ArrayList<>();
    setupInnerArrays(glasss);

}

private void setupInnerArrays(Class<t> glasss)
{
    long numberOfArrays = length / ARRAY_LENGTH;
    long remender = length % ARRAY_LENGTH;
    /*
        we can use java 8 lambdas and streams:
        LongStream.range(0, numberOfArrays).
                        forEach(i ->
                        {
                            arrays.add((t[]) Array.newInstance(glasss, ARRAY_LENGTH));
                        });
     */

    for (int i = 0; i < numberOfArrays; i++)
    {
        arrays.add((t[]) Array.newInstance(glasss, ARRAY_LENGTH));
    }
    if (remender > 0)
    {
        //the remainer will 100% be less than the [ARRAY_LENGTH which is int ] so
        //no worries of casting (:
        arrays.add((t[]) Array.newInstance(glasss, (int) remender));
    }
}

public void put(t value, long index)
{
    if (index >= length || index < 0)
    {
        throw new IndexOutOfBoundsException("out of the reange of the array, your index must be in this range [0, " + length + "]");
    }
    int indexOfArray = (int) (index / ARRAY_LENGTH);
    int indexInArray = (int) (index - (indexOfArray * ARRAY_LENGTH));
    arrays.get(indexOfArray)[indexInArray] = value;

}

public t get(long index)
{
    if (index >= length || index < 0)
    {
        throw new IndexOutOfBoundsException("out of the reange of the array, your index must be in this range [0, " + length + "]");
    }
    int indexOfArray = (int) (index / ARRAY_LENGTH);
    int indexInArray = (int) (index - (indexOfArray * ARRAY_LENGTH));
    return arrays.get(indexOfArray)[indexInArray];
}

}

and here's the test

public static void main(String[] args)
{
    long length = 60085147514l;
    BigArray<String> array = new BigArray<>(length, String.class);
    array.put("peace be upon you", 1);
    array.put("yes it worj", 1755);
    String text = array.get(1755);
    System.out.println(text + "  i am a string comming from an array ");

}

this code is only limited by only Long.MAX_VALUE and Java heap but you can exceed it as you want (I made it 3800 MB)

i hope this is useful and provide a simple answer

Anas
  • 688
  • 7
  • 24
2

I suggest you define some "block" objects, each of which holds (say) 1Gb in an array, then make an array of those.

pjc50
  • 1,856
  • 16
  • 18
2

No, arrays are indexed by ints (except some versions of JavaCard that use shorts). You will need to slice it up into smaller arrays, probably wrapping in a type that gives you get(long), set(long,byte), etc. With sections of data that large, you might want to map the file use java.nio.

Tom Hawtin - tackline
  • 145,806
  • 30
  • 211
  • 305
1

I think the idea of memory-mapping the file (using the CPU's virtual memory hardware) is the right approach. Except that MappedByteBuffer has the same limitation of 2Gb as native arrays. This guy claims to have solved the problem with a pretty simple alternative to MappedByteBuffer:

http://nyeggen.com/post/2014-05-18-memory-mapping-%3E2gb-of-data-in-java/

https://gist.github.com/bnyeggen/c679a5ea6a68503ed19f#file-mmapper-java

Unfortunately the JVM crashes when you read beyond 500Mb.

Tim Cooper
  • 10,023
  • 5
  • 61
  • 77
  • While in this specific example my use case was to read a file, this is not the only use case for large arrays. – Omry Yadan Jul 06 '16 at 17:21
1

java doesn't support direct array with more than 2^32 elements presently,

hope to see this feature of java in future

nicola
  • 2,141
  • 1
  • 19
  • 19
1

Java arrays use integers for their indices. As a result, the maximum array size is Integer.MAX_VALUE.

(Unfortunately, I can't find any proof from Sun themselves about this, but there are plenty of discussions on their forums about it already.)

I think the best solution you could do in the meantime would be to make a 2D array, i.e.:

byte[][] data;
Dan Lew
  • 85,990
  • 32
  • 182
  • 176
1

As others have said, all Java arrays of all types are indexed by int, and so can be of max size 231 − 1, or 2147483647 elements (~2 billion). This is specified by the Java Language Specification so switching to another operating system or Java Virtual Machine won't help.

If you wanted to write a class to overcome this as suggested above you could, which could use an array of arrays (for a lot of flexibility) or change types (a long is 8 bytes so a long[] can be 8 times bigger than a byte[]).

Nayuki
  • 17,911
  • 6
  • 53
  • 80
Nick Fortescue
  • 43,045
  • 26
  • 106
  • 134