0

In java (jdk 1.6) is there a way to check a file is a valid xlsx without opening the entire file with POI or other API. Currently we use Apache POI in the project to open the file - basically we create a new XSSFWorkbook(inputStream) and if that throws an exception it is not a valid xlsx. However we found one xlsx file which is 8MB is taking 1GB memory to open for some reason and actually caused a production outage on our servers. We can not rely on the file extension as someone can take a file which is not xlsx like a php file and rename with xlsx extension. I'm looking for some option which has minimal memory impacts - ideally not opening the file at all.

Its too much of a risk if a single file upload can kill the server but we also still need to validate the file is in fact an xlsx.

George
  • 1,021
  • 15
  • 32
  • avoid tools that use the new poi 4.0.0 release as it only supports Java 8 - have a look at https://github.com/monitorjbl/excel-streaming-reader - it supports poi 3.17 and therefore can be used with Java 6 – PJ Fanning Oct 01 '18 at 23:12
  • 3
    Define "valid xlsx" - depending on your requirements here (from "looks like something in a zip" to "opens without error in Excel") the solution will vary! – Gagravarr Oct 01 '18 at 23:18
  • And whether you also consider xlsm, xlsb, xls (and older formats) valid Excel workbooks. – IceArdor Oct 02 '18 at 04:42

1 Answers1

0

If you don't know what your file is at all, use Apache Tika to do the detection - it can detect a huge number of different file formats for you.

Determine MS Excel file type with Apache POI

here are some examples https://www.baeldung.com/apache-tika

John
  • 446
  • 6
  • 16
  • Specifically, use Tika content detection https://tika.apache.org/1.19/detection.html using tika-mimetypes.xml (or build your own xml for just the formats you care about) https://github.com/apache/tika/blob/master/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml#L2350 – IceArdor Oct 02 '18 at 04:47
  • 2
    @IceArdor You can't tell the difference between a `.zip` and a `.xlsx` or a `.odp` (for example) with mime magic alone, you'd need Apache Tika's zip-aware detector for that (available as standard via `DefaultDetector`) – Gagravarr Oct 02 '18 at 10:21
  • @Gagravarr absolutely right! https://tika.apache.org/1.19/api/org/apache/tika/detect/DefaultDetector.html and https://tika.apache.org/1.19/detection.html#Container_Aware_Detection – IceArdor Oct 04 '18 at 05:47