3

There seems to be a problem with getting deterministic hash values for the POI XLSX format, with MessageDigest SHA-256 implementation, even for empty ByteArray streams. This happens randomly, after several hundreds or even only thousands of iterations.

The relevant code snippets used to reproduce the problem:

// TestNG FileTest:
@Test(enabled = true) // indeterminism at random iterations, such as 400 or 1290
public void emptyXLSXTest() throws IOException, NoSuchAlgorithmException {
    final Hasher hasher = new HasherImpl();
    boolean differentSHA256Hash = false;
    for (int i = 0; i < 10000; i++) {
        final ByteArrayOutputStream excelAdHoc1 = BusinessPlanInMemory.getEmptyExcel("xlsx");
        final ByteArrayOutputStream excelAdHoc2 = BusinessPlanInMemory.getEmptyExcel("xlsx");

        byte[] expectedByteArray = excelAdHoc1.toByteArray();
String expectedSha256 = hasher.sha256(expectedByteArray);
byte[] actualByteArray = excelAdHoc2.toByteArray();
String actualSha256 = hasher.sha256(actualByteArray);

if (!expectedSha256.equals(actualSha256)) {
            differentSHA256Hash = true;
            System.out.println("ITERATION: " + i);
            System.out.println("EXPECTED HASH: " + expectedSha256);
            System.out.println("ACTUAL HASH: " + actualSha256);
            break;
        }
    }
    Assert.assertTrue(differentSHA256Hash, "Indeterminism did not occur");
}

Referenced Hasher and POI code:

// HasherImpl class:
public String sha256(final InputStream stream) throws IOException, NoSuchAlgorithmException {
    final MessageDigest digest = MessageDigest.getInstance("SHA-256");
    final byte[] bytesBuffer = new byte[300000]; 
    int bytesRead = -1;
    while ((bytesRead = stream.read(bytesBuffer)) != -1) {
        digest.update(bytesBuffer, 0, bytesRead);
    }
    final byte[] hashedBytes = digest.digest();
    return bytesToHex(hashedBytes);
}

Tried to eliminate indeterminism due to meta data like creation time, to no avail:

// POI BusinessPlanInMemory helper class:
public static ByteArrayOutputStream getEmptyExcel(final String fileextension) throws IOException {
    Workbook wb;

    if (fileextension.equals("xls")) {
        wb = new HSSFWorkbook();
    }
    else {
        wb = new XSSFWorkbook();
        final POIXMLProperties props = ((XSSFWorkbook) wb).getProperties();
        final POIXMLProperties.CoreProperties coreProp = props.getCoreProperties();
        coreProp.setCreated("");
        coreProp.setIdentifier("1");
        coreProp.setModified("");
    }

    wb.createSheet();

    final ByteArrayOutputStream excelStream = new ByteArrayOutputStream();
    wb.write(excelStream);
    wb.close();
    return excelStream;
}

The HSSF / XLS format seems not to be affected by the problem described. Does anybody have a clue, what could be causing this, if not a bug in POI itself? Basically, the code above refers to https://poi.apache.org/spreadsheet/examples.htmlBusinessPlan example

Thanks for your input!

fozzybear
  • 91
  • 9

3 Answers3

2

This is not a definitive answer but this is my suspicion what happens:

docx and xlsx file formats are basically a bunch of zipped-up xml-files. This can easily be seen when renaming them to .zip and opening with your favorite zip-tool.

When examining a file created by word I noticed that the change-timestamp of all files contained in the archive is always 1980-01-01 00:00:00 while in those created with POI it will show the actual timestamp the file was created.

So my I suspect that your problem occurs when there is a timestamp-difference between one or more of the files in excelAdHoc1 and excelAdHoc2. This might happen when the clock switches to the next second while creating one or the other file.

This would not affect XLS-files since the HSSF-format is not of the "zipped xml"-type and thus does not contain any nested files that might have different timestamps.

To change the timestamps after writing the file you could try using the `java.util.zip``-package. I haven't tested it but this should do the trick:

ZipFile file = new ZipFile(pathToFile);
Enumeration<ZipEntry> e = file.entries();
while(e.hasMoreElements()) {
    ZipEntry entry = e.nextElement();
    entry.setTime(0L);
}
piet.t
  • 11,718
  • 21
  • 43
  • 52
  • Thanks for your thoughts on this. I'll have to test and write the actual files to doublecheck it. But shouldn't setting the CoreProperty meta data (creation and modifying time, as done above) prevent a situation like this? Or does it only affect internal meta data, not those of the archive? – fozzybear Aug 01 '16 at 13:55
  • I think the best way to check this would be as you said: write the file and check the zip-contents. In my existing files I had not modified the CoreProterties, so I can't tell if that's what is causing the difference in my case. – piet.t Aug 01 '16 at 14:07
  • Looks like you've put me on the right track, piet! I've extracted the generated expected and actual contents, and everything was similar, files, folders, CRC's, but the modification times differed by 2 secs. Given the fact, that I've explicitly told POI to clear the modification times, this is odd. Unless, this affects other, internal modification times. Now I just need to figure out, how to manipulate the modification times of the files inside the XLSX, prior or after creation. Otherwise, I see no other way, but to un-zip, touch and re-zip the files. What do you think? – fozzybear Aug 01 '16 at 14:36
  • @fozzybear Your code tells POI to change the document-level metadata, which is what gets shown in Office if you check the document properties. It doesn't affect the zip-level metadata, which is automatically set by the low-level zip library – Gagravarr Aug 01 '16 at 16:55
  • @Gagravarr Yes, I've figured it out, but need to find a way to change the zip content's properties. Think I've found a possible solution here [link](http://stackoverflow.com/questions/15667125/read-content-from-files-which-are-inside-zip-file) – fozzybear Aug 01 '16 at 17:37
  • @fozzybear I added some code that should change the timestamp in the zip-file. it is untested, so feel free to correct all errors you find ;) – piet.t Aug 02 '16 at 05:52
  • @piet.t I've tried your ZipFile code snippet, but it seemed not to be possible to alter zip-internal file attributes. Extracting and looking at them in cygwin terminal with `ls -la --full-time` reveals, that neither the file's lastmodified attributes, nor the directorie's attributes are affected (which seem not be be iterated with the Enumeration anyway). Or is this only inteded for reading a zip file and neccessary to write a new ZipFile from the modified ZipFile entries - or using a ZipOutputStream in the first place? – fozzybear Aug 03 '16 at 15:22
1

The issue is indeed the timestamps on the ZipEntrys of the ZipFile.

The following code resets the timestamps of an xlsx archive, without any write to disk (i.e. no usage of File API).

private byte[] resetZipTimestamps(byte[] bytes) {
try {
  ZipInputStream zipInputStream = new ZipInputStream(
    new ByteArrayInputStream(bytes)
  );
  ByteArrayOutputStream byteArrayOutputStream = new ByteArrayOutputStream();
  ZipOutputStream zipOutputStream = new ZipOutputStream(
    byteArrayOutputStream
  );
  ZipEntry zipEntry;
  while ((zipEntry = zipInputStream.getNextEntry()) != null) {
    Instant instant = ZonedDateTime
      .of(2023, 1, 1, 0, 0, 0, 0, ZoneId.systemDefault())
      .toInstant();
    zipEntry.setLastModifiedTime(FileTime.from(instant));
    zipEntry.setCreationTime(FileTime.from(instant));
    zipOutputStream.putNextEntry(zipEntry);
    byte[] buffer = new byte[256];
    int readLength;
    while ((readLength = zipInputStream.read(buffer, 0, 256)) > 0) {
      zipOutputStream.write(buffer, 0, readLength);
    }
  }
  zipOutputStream.close();
  return byteArrayOutputStream.toByteArray();
} catch (IOException e) {
  throw new IllegalStateException(e);
}

}

Tested successfully on Java 17 with an XLSX generated with Apache POI 5.2.3. md5sum is now deterministic, so checksums can be used in regression testing of Excel file generation.

zipinfo before:

$ zipinfo report_before.xlsx 
Archive:  report_before.xlsx
Zip file size: 66564 bytes, number of entries: 15
-rw----     2.0 fat     2197 bl defN 23-Mar-02 17:01 [Content_Types].xml
-rw----     2.0 fat      592 bl defN 23-Mar-02 17:01 _rels/.rels
-rw----     2.0 fat      383 bl defN 23-Mar-02 17:01 docProps/app.xml
-rw----     2.0 fat      540 bl defN 23-Mar-02 17:01 docProps/core.xml
-rw----     2.0 fat    17076 bl defN 23-Mar-02 17:01 xl/sharedStrings.xml
-rw----     2.0 fat     5837 bl defN 23-Mar-02 17:01 xl/styles.xml
-rw----     2.0 fat     1236 bl defN 23-Mar-02 17:01 xl/workbook.xml
-rw----     2.0 fat     1409 bl defN 23-Mar-02 17:01 xl/_rels/workbook.xml.rels
-rw----     2.0 fat     1988 bl defN 23-Mar-02 17:01 xl/worksheets/sheet1.xml
-rw----     2.0 fat     2756 bl defN 23-Mar-02 17:01 xl/worksheets/sheet2.xml
-rw----     2.0 fat     1874 bl defN 23-Mar-02 17:01 xl/worksheets/sheet3.xml
-rw----     2.0 fat     3203 bl defN 23-Mar-02 17:01 xl/worksheets/sheet4.xml
-rw----     2.0 fat     8227 bl defN 23-Mar-02 17:01 xl/worksheets/sheet5.xml
-rw----     2.0 fat    27795 bl defN 23-Mar-02 17:01 xl/worksheets/sheet6.xml
-rw----     2.0 fat   370197 bl defN 23-Mar-02 17:01 xl/worksheets/sheet7.xml
15 files, 445310 bytes uncompressed, 64552 bytes compressed:  85.5%

zipinfo after:

$ zipinfo report.xlsx 
Archive:  report.xlsx
Zip file size: 66894 bytes, number of entries: 15
-rw----     2.0 fat     2197 bX defN 23-Jan-01 00:00 [Content_Types].xml
-rw----     2.0 fat      592 bX defN 23-Jan-01 00:00 _rels/.rels
-rw----     2.0 fat      383 bX defN 23-Jan-01 00:00 docProps/app.xml
-rw----     2.0 fat      540 bX defN 23-Jan-01 00:00 docProps/core.xml
-rw----     2.0 fat    17076 bX defN 23-Jan-01 00:00 xl/sharedStrings.xml
-rw----     2.0 fat     5837 bX defN 23-Jan-01 00:00 xl/styles.xml
-rw----     2.0 fat     1236 bX defN 23-Jan-01 00:00 xl/workbook.xml
-rw----     2.0 fat     1409 bX defN 23-Jan-01 00:00 xl/_rels/workbook.xml.rels
-rw----     2.0 fat     1988 bX defN 23-Jan-01 00:00 xl/worksheets/sheet1.xml
-rw----     2.0 fat     2756 bX defN 23-Jan-01 00:00 xl/worksheets/sheet2.xml
-rw----     2.0 fat     1874 bX defN 23-Jan-01 00:00 xl/worksheets/sheet3.xml
-rw----     2.0 fat     3203 bX defN 23-Jan-01 00:00 xl/worksheets/sheet4.xml
-rw----     2.0 fat     8227 bX defN 23-Jan-01 00:00 xl/worksheets/sheet5.xml
-rw----     2.0 fat    27795 bX defN 23-Jan-01 00:00 xl/worksheets/sheet6.xml
-rw----     2.0 fat   370197 bX defN 23-Jan-01 00:00 xl/worksheets/sheet7.xml
15 files, 445310 bytes uncompressed, 64552 bytes compressed:  85.5%
rcomblen
  • 4,579
  • 1
  • 27
  • 32
0

Ok, I've found a way to reset all XSLX file entry file time attributes, according to some example found here at SO. Unfortunately, only file entries seem to be accessible by methods like via ZipFile or OPCPackage. I could not find a solution to also access and reset the folders inside the archive, which also have differing time attributes.

So far, I didn't succeed in eliminating the differing properties of the POI-generated XLSX-archives, to get same SHA256 hashes out of two otherwise identical files, for which the differing attributes seem to be the reason.

private void resetOPCPTimeAttributes(File file)
        throws InvalidFormatException, IOException, OpenXML4JException, XmlException {

    OPCPackage opcp = ZipPackage.open(file);
    resetZipfileContentTimeAttributes(opcp.getParts());

    opcp.flush();
    opcp.close();
}

private void resetZipfileContentTimeAttributes(List<PackagePart> parts) throws InvalidFormatException {

    ArrayList<PackagePart> subParts = null;
    for (PackagePart part: parts) {

        PackageProperties props = part.getPackage().getPackageProperties();
        props.setLastModifiedByProperty("");
        props.setCreatedProperty("");
        props.setModifiedProperty("");

        subParts = part.getPackage().getParts();

        while (subParts != null) {
            resetZipfileContentTimeAttributes(subParts);
        }
    }
}

Edit:

In the meantime (until I or someone else finds a solution for manipulating folder meta data inside the Zip archive), I've switched to the deep comparing solution here: Comparing XLSX files

fozzybear
  • 91
  • 9