0

Edit: to those marking this as a duplicate, this has nothing (directly) to do with array limits. This has to do with the Apache POI libraries not currently being capable of handling larger files.

I've used apache POI to generate an XLSX file that is about 6GB. The file works perfectly fine when I open it in Excel. I'm now trying to add encryption when I generate the file, but I'm running into error with any file greater than 2GB (files less than 2GB encrypt fine and are also openable in Excel)

org.apache.poi.util.RecordFormatException: Can't allocate an array > 2147483647
    at org.apache.poi.util.IOUtils.safelyAllocate(IOUtils.java:545)
    at org.apache.poi.poifs.nio.ByteArrayBackedDataSource.extend(ByteArrayBackedDataSource.java:85)
    at org.apache.poi.poifs.nio.ByteArrayBackedDataSource.write(ByteArrayBackedDataSource.java:62)
    at org.apache.poi.poifs.filesystem.POIFSFileSystem.createBlockIfNeeded(POIFSFileSystem.java:453)
    at org.apache.poi.poifs.filesystem.POIFSStream$StreamBlockByteBuffer.createBlockIfNeeded(POIFSStream.java:223)
    at org.apache.poi.poifs.filesystem.POIFSStream$StreamBlockByteBuffer.write(POIFSStream.java:245)
    at org.apache.poi.poifs.filesystem.DocumentOutputStream.write(DocumentOutputStream.java:144)
    at org.apache.poi.util.IOUtils.copy(IOUtils.java:402)
    at org.apache.poi.poifs.crypt.ChunkedCipherOutputStream$EncryptedPackageWriter.processPOIFSWriterEvent(ChunkedCipherOutputStream.java:299)
    at org.apache.poi.poifs.filesystem.POIFSDocument.<init>(POIFSDocument.java:117)
    at org.apache.poi.poifs.filesystem.DirectoryNode.createDocument(DirectoryNode.java:373)
    at org.apache.poi.poifs.crypt.ChunkedCipherOutputStream.close(ChunkedCipherOutputStream.java:250)
    .....

The code I'm using to encrypt the file is:

    public static void encryptFile(File unencryptedFile, File outputFile, String password) throws IOException, InvalidFormatException, GeneralSecurityException
    {
        POIFSFileSystem fs = new POIFSFileSystem();
        EncryptionInfo info = new EncryptionInfo(EncryptionMode.agile);

        Encryptor enc = info.getEncryptor();
        enc.confirmPassword(password);

        try(OPCPackage opc = OPCPackage.open(unencryptedFile))
        {
            try(OutputStream os = enc.getDataStream(fs))
            {
                opc.save(os);
            } <-----EXCEPTION IS THROWN HERE BY THE CALL TO close()
        } 


        try(FileOutputStream fos = new FileOutputStream(outputFile))
        {
            fs.writeFilesystem(fos);
        }
    }

The exception is getting throw by the close() when this try block exits (as noted in the code above). Does anyone know another method I can use to encrypt the file that isn't limited to 2GB?

ldkronos
  • 161
  • 1
  • 4
  • Arrays are indexed by integers. The max integer in Java is `2,147,483,647`. That is why you are getting the error. Under the hood they are using a byte array. So, that library appears to support files of max size `2.147483647GB`. Maybe try using an external library that is just for encrypting large files. Hopefully a library that lets you read / encrypt the bytes as a stream rather so u can read it in chunks and note blow out RAM? –  Apr 12 '19 at 22:01
  • 2
    POI also has FileBackedDataSource. Your code is using the ByteArrayBackedDataSource. The file based version should support larger data sets. – PJ Fanning Apr 12 '19 at 22:24
  • 1
    new POIFSFileSystem(file) -- should use FileBackedDataSource based on the file you provide – PJ Fanning Apr 12 '19 at 22:35
  • @PJFanning - Thanks, thats a start to begin looking, but it doesn't seem quite that simple. It seems like the File parameter is only useful when reading an existing file. By default it opens it read only. I tried calling POIFSFileSystem(File, boolean) with the second param to false to indicate it's NOT read only. When I do that, it complains about my file is zero byte. I tried using the FileChannel option instead and it seems it's expecting the file to already have valid office headers (ie: reading an existing file). So I'm not sure how to get it to use a FileBackedDataSource for writing – ldkronos Apr 13 '19 at 04:30
  • 1
    I logged a https://bz.apache.org/bugzilla/show_bug.cgi?id=63344 and will add an experimental fix. This will require you to download the latest POI source and build the POI jars locally. – PJ Fanning Apr 13 '19 at 09:57
  • https://github.com/apache/poi/commit/e8b5f8b3af7532e09e60c258afbb706d0df4b793 - TempFilePOIFSFileSystem can be used in place of POIFSFileSystem - main POI build is ant based (eg `ant jar` will build the jars) – PJ Fanning Apr 13 '19 at 10:07
  • If you create the POIFSFileSystem with the large block size rather than the (default) small block size, does that work better? The large block size allows for much larger OLE2 files – Gagravarr Apr 13 '19 at 15:50
  • @PJFanning - No luck. I downloaded the snapshot and built it just fine. The temp file works fine for small files, but it seems in the code there is still some stuff using int instead of longs for indexs. – ldkronos Apr 15 '19 at 14:55
  • Exception in thread "main" java.lang.IllegalArgumentException: Negative position at java.base/sun.nio.ch.FileChannelImpl.write(FileChannelImpl.java:825) at org.apache.poi.poifs.nio.FileBackedDataSource.write(FileBackedDataSource.java:120) at org.apache.poi.poifs.filesystem.POIFSFileSystem.createBAT(POIFSFileSystem.java:411) at org.apache.poi.poifs.filesystem.POIFSFileSystem.getFreeBlock(POIFSFileSystem.java:512) at org.apache.poi.poifs.filesystem.POIFSStream$StreamBlockByteBuffer.createBlockIfNeeded(POIFSStream.java:201) at – ldkronos Apr 15 '19 at 14:57
  • I don't have any 2Gb xlsx files hanging around to test with, so this is kind of trial and error. – PJ Fanning Apr 15 '19 at 15:26
  • could you add logging in org.apache.poi.poifs.filesystem.POIFSFileSystem.createBAT to see if its `int writeTo` is the issue? – PJ Fanning Apr 15 '19 at 15:31
  • @PJFanning I just created a random xlsx file with this: https://pastebin.com/fkkHcuUw – ldkronos Apr 15 '19 at 16:45
  • I generated a spreadsheet using the pastebin code and a fix that I added to POI for the int overflow - the encryption runs but I'm suspicious about the output. The file was 2147483647 bytes which suggests that the file was truncated at Integer.MAX_VALUE. – PJ Fanning Apr 15 '19 at 17:31
  • @PJFanning - Yes, I originally tried to generate the unencrypted file once, and then keep reusing it for quicker testing. But for some odd reason. The unencrypted file (which I would not expect to be touched during the encryption) also gets truncated to exactly 2GB (was 2.15GB before). Now I just save a 2nd copy of the file and manually the original unencrypted.xlsx before each run. That seems like a separate bug to me. And yes, I tried converting the int to a long (and the "1 +" to a "1L +" to ensure type) and it runs without error but results in a corrupt file of exactly 2GB. – ldkronos Apr 15 '19 at 17:58
  • under the hood, the POIFS classes use int to represent the size param and I'm afraid that a lot of code will need to change to make this a long - I'm not sure whether this would also lead to other issues -- my suggestion is that if you have an immediate requirement to support large xlsx files, that you encrypt the files as opposed to password protecting them - this would mean the consumer would need to decrypt the file before opening it in Excel – PJ Fanning Apr 15 '19 at 21:43
  • @PJFanning - Unfortunately wrapper encryption won't work as we need stats software to be able to natively load the file without leaving it decrypted. Thanks for your attempts at helping with this. I'll keep an eye on the bug report and hopefully someone can figure out a fix at some point. Until then we'll have to work around it by splitting the dataset across files (which is even uglier than the wrapper option, but at least it fits the requirements we're force to work within) – ldkronos Apr 18 '19 at 00:05

0 Answers0