19

I'm currently using an instance of RandomAccessFile to manage some in-memory data, but the size of my RandomAccessFile instance is beyond 2^64 bytes, so I cannot used methods such as seek() and write() because they use Long and cannot manage an address space bigger than 2^64. So what do I do ? Is there something else I can use which supports an address space beyond 2^64 ?

EDIT: Reason for asking this question:

I have a Tree data structure which in theory can have upto 2^128 nodes, and I want to store this tree onto a file. Each node has data that's roughly 6 bytes. So I'm wondering how will I store this tree to file.

Ahmad
  • 12,886
  • 30
  • 93
  • 146
  • 3
    Wait, you're using a file to manage in-memory data, and you have more than 8 million terabytes of data to manage? – JB Nizet Aug 02 '17 at 20:56
  • 3
    It seems unlikely that your file has more bytes than the max size of a long. – DwB Aug 02 '17 at 20:58
  • 1
    Java does not even have a primitive type that is appropriate for recording offsets into a file as long as you describe. There might nevertheless be a third-party library that can handle such enormous files (I don't know), but software recommendations are off-topic here. As others have urged, do be sure that this is really a problem you face. The issue of files exceeding 4GB or even 2GB -- the size of a C `long` in many C implementations -- is real and important, but this is the first I've ever heard of a single file exceeding 9EB in size. – John Bollinger Aug 02 '17 at 21:09
  • 2
    It seems unlikely that humanity has ever created a single file larger than 2^64 bytes. Your 2^128 * 6 byte tree exceeds the total data storage capacity currently available on Earth by a factor of many billion times. – Boann Aug 02 '17 at 21:11
  • I added clarification above ... – Ahmad Aug 02 '17 at 21:11
  • 1
    Its a tree to store IPv6 ranges ... – Ahmad Aug 02 '17 at 21:21
  • 2
    If your algorithm needs more storage, design a new one. 3-4 years ago the combined storage need of Google+Facebook was less than that. See table with illustrative data sizes on https://en.m.wikipedia.org/wiki/Orders_of_magnitude_(data) – tevemadar Aug 02 '17 at 22:02

5 Answers5

16

Not a proper answer, but are you sure your file is actually this large?

From the docs for Long.MAX_VALUE:

A constant holding the maximum value a long can have, 2^63-1.

From the docs for RandomAccessFile.length():

the length of this file, measured in bytes.

Do you know how many bytes 2^63-1 is? Rather, 9,223,372,036,854,775,807 bytes?

9,223,372,036,854,775,807 B
9,223,372,036,854,775    KB
9,223,372,036,854        MB
9,223,372,036            GB
9,223,372                TB
9,223                    PB
9                        EB

If I math'd correctly, you would need a constant write speed of about 272GB/s for 1 year.

While this is an excellent question I would like to see an answer to, I highly doubt that you have a single file that will be 9EB in size, if the OS will even support this.

edit

Here are some File System Limits, and much to my own surprise, NTFS will actually support single files up to 16EiB, however that is only one of only a few on the list that do support it.


If you ABSOLUTELY need to access a file larger then 9EiB, it looks like you might need to roll your own version of RandomAccessFile, using BigInteger where the other uses long. This could get you up to (2 ^ 32) ^ Integer.MAX_VALUE bytes.

Matt Clark
  • 27,671
  • 19
  • 68
  • 123
  • 3
    NTFS supports files up to 16 EB because they store the size as an unsigned 64-bit value, whereas Java's `long` is a *signed* 64-bit value. Still, even NTFS doesn't go beyond 64-bit sizes, because there's no storage device that could store such files. – Boann Aug 02 '17 at 21:06
  • Ignoring the obvious about memory and storage, @MattClark is right. Rolling your own using BigInteger is the only answer I see. – Dakoda Aug 07 '17 at 16:29
  • Matt, why did you add a bounty to a question you answered? Just curious – Carlos Bribiescas Aug 09 '17 at 22:43
  • Because as I stated in my _answer_, it is not really a proper answer to the question. As this is an unlikely question, it would be interesting to see a real, functioning answer to. – Matt Clark Aug 09 '17 at 22:45
  • IMO, it would be significantly over-engineered since there isn't hardware to use it... – Carlos Bribiescas Aug 09 '17 at 22:51
  • That's more storage than LMG has :D – eliaspr Aug 11 '17 at 19:01
  • Yeah, this is 18.4467441 exabytes which is a lot. – ACV Aug 11 '17 at 21:02
3

I suppose that your question borns from this requirement "Is there something else I can use which supports an address space beyond". In another word, you want to access to memory by address, and your address could be big.

Of course, you should not allocate 2^128 * 6 bytes file, even if it would be possible nowadays, it would be too expensive. The typical approach here is split your storage into smaller parts and address it accordingly. For instance

write(partition, address, node);
node = read(partition, address);

As you said, you should store IPv6 addresses. To store IPv6 and search fast over it is enough to have a table with 8 columns and indexes for each part of an ipv6 address. Or you can store information in tree hierarchy like:

  • 0000
    • 0000
      • 0000
        • etc
    • 0001
      • 0000
        • etc

Which you should allocate on demand. So the real question should be how to organize your storage effectively.

UPDATE

I want to note that in reality there is private API in Java (Oracle JDK, not OpenJDK), which can give you an opportunity to handle files more than 2 Gb, but it is private, is not a part of public API at all, so I wouldn't describe it here, without requests. You could find it directly in sun.nio.ch.FileChannelImpl (private map0, unmap0 methods).

egorlitvinenko
  • 2,736
  • 2
  • 16
  • 36
2

Even if you had the software to do such things, it would be unusable at the scale you suggest since there doesn't exist a single machine with that much disk space.

So, since the main issue is the hardware limitations of a single machine, the solution would be to use a distributed computing framework that will allow you to scale out as much as needed. I suggest using https://ignite.apache.org/ as its incredibly flexible and has a pretty decent support here on stack overflow.

Coming at this from another perspective, you want to store IPv6 ip addresses. At the theoretical level, sure you will need 2^64 addresses. At the practical level, even if you attempted to index every IP out there today, you wouldn't significantly pass 2^32 since that is the number of IPv4s addresses and we are just passing that limit.

Carlos Bribiescas
  • 4,197
  • 9
  • 35
  • 66
  • The hardware limitations would should certainly not be RAM if implemented correctly. From the [Java Docs](https://docs.oracle.com/javase/7/docs/api/java/io/RandomAccessFile.html) - `A random access file behaves like a large array of bytes stored in the file system.` When backed by the filesystem, you can open a file descriptor and seek to the desired location. You should not be loading the whole file into memory. – Matt Clark Aug 09 '17 at 23:02
  • I may have misunderstood as I thought he would load it into memory. Same argument applies for hard disk space though as I doubt he would (practically) be able to find a single machine with that much disk space. The practical Solution is distributed computing. I won't update my answer though because I don't want to lookup the actual possible max disk for a modern computer. :-) – Carlos Bribiescas Aug 09 '17 at 23:09
  • If you read my answer, [that also is there](https://en.wikipedia.org/wiki/Comparison_of_file_systems#Limits). My bounty was to find a more specific implementation, as my answer was just spelling out the semantics or the issue. :) – Matt Clark Aug 09 '17 at 23:11
  • 1
    Removing altogether the part about RAM. I think that my answer is an actual solution because it goes beyond the academic, theoretical problem and gives him a means/avenue to actually solve his problem. Thanks for pointing out I misread the question – Carlos Bribiescas Aug 09 '17 at 23:17
0

Yeah, this is 18.4467441 Exabytes which is a lot. You cannot store this in memory as there is no computer or even cluster with such memory (RAM).

Of course you can write to files. But these should definitely be multiple files. I don't think it is possible to have 1 such large file. And if it were possible, it would take hours or days to seek it. So there are 2 approaches:

  1. Split in multiple smaller files

  2. Use "streams" - read a bit, process, write and read next.

ACV
  • 9,964
  • 5
  • 76
  • 81
  • As [already discussed](https://stackoverflow.com/questions/45470750/randomaccessfile-with-support-beyond-long/45470852#comment78163274_45601301) [on another answer](https://stackoverflow.com/posts/45601301/revisions), Java's [RandomAccessFile](https://docs.oracle.com/javase/7/docs/api/java/io/RandomAccessFile.html) is already based on a File System file and has the ability to seek to s specific position, and stream, there is no need, or reason, to load it into memory. – Matt Clark Aug 13 '17 at 19:37
0

Maybe it is a silly observation, but did you think in serialize your data structure? There are many examples online, looking around I found this simple example that you could adjust to your tree, then you can do the conversion to store the data.

developer_hatch
  • 15,898
  • 3
  • 42
  • 75