4

I searched how to detect that file is .xls and I've found a solution like this (but not deprecated):
POIFSFileSystem:

@Deprecated
@Removal(version="4.0")
public static boolean hasPOIFSHeader(InputStream inp) throws IOException {
    return FileMagic.valueOf(inp) == FileMagic.OLE2;
}

But this one returns true for all microsoft word documents for example for .doc

Is there a way to detect .xls document?

Aleksei Budiak
  • 871
  • 5
  • 16
gstackoverflow
  • 36,709
  • 117
  • 359
  • 710
  • Have a look at the source of [EmbeddedExtractor](https://poi.apache.org/apidocs/org/apache/poi/ss/extractor/EmbeddedExtractor.html) – kiwiwings Sep 22 '17 at 20:14
  • 1
    @kiwiwings You seem experienced with Apache POI. Wouldn't it be enough to check something like `POIFSFileSystem(inputStream).getRoot().getStorageClsid().equals(ClassID.EXCEL97)`? Since I don't know the file format well I couldn't be sure that it is reliable or not. – Kul-Tigin Sep 26 '17 at 06:43
  • 2
    @Kul-Tigin Beside the `ClassID`, I would also check for `DirectoryNode` names - see the `BiffExtractor` in [EmbeddedExtractor](https://svn.apache.org/repos/asf/poi/trunk/src/ooxml/java/org/apache/poi/ss/extractor/EmbeddedExtractor.java). Not every program, which outputs .xls/.doc, sets the ClassId correct. Be aware that there are multiple `DirectoryNode` names for e.g. Excel (see [`InternalWorkbook`](https://svn.apache.org/repos/asf/poi/trunk/src/java/org/apache/poi/hssf/model/InternalWorkbook.java)`.WORKBOOK_DIR_ENTRY_NAMES`) – kiwiwings Sep 26 '17 at 16:28
  • Is it an option to use Apache Tika? https://tika.apache.org/ – Nathanael Sep 27 '17 at 12:29
  • @Nathanael Tika usesPOI internally – gstackoverflow Sep 27 '17 at 13:33
  • 1
    Yes, if you could use Tika you can use the MimeType detection. Or you can look in the source code how they implemented it. E.g. the POIFSContainerDetector – Nathanael Sep 28 '17 at 12:30
  • @Nathanael, I can use Tika. – gstackoverflow Sep 28 '17 at 12:33
  • The `file` command (cygwin) seems to differentiate these. You use it via ProcessBuilder or use its code to do file magic check. – Jayan Oct 01 '17 at 08:39

3 Answers3

3

Both .doc/.xls documents can are stored in the OLE2 storage format. The org.apache.poi.poifs.filesystem.FileMagic helps you to detect the file storage format only and not sufficient alone to distinguish between .doc/.xls files.

Also it does not appear that there is any direct API available in POI library to determine the document type (excel or document) for given inputstream/file.

Below example my be helpful to determine if given stream is a valid .xls (or .xlsx)file with the caveat that it read the given inputstram and close it.

    // slurp content from given input and close it
    public static boolean isExcelFile(InputStream in) throws IOException {
        try {
            // it slurp the input stream
            Workbook workbook = org.apache.poi.ss.usermodel.WorkbookFactory.create(in);
            workbook.close();
            return true;

        } catch (java.lang.IllegalArgumentException | org.apache.poi.openxml4j.exceptions.InvalidFormatException e) {
            return false;
        }
    }

You may found more information on excel file format on this link

Update Solution based on Apache Tika as suggested by gagravarr:

public class TikaBasedFileTypeDetector {
    private Tika tika;
    private TemporaryResources temporaryResources;

    public void init() {
        this.tika = new Tika();
        this.temporaryResources = new TemporaryResources();
    }

    // clean up all the temporary resources
    public void destroy() throws IOException {
        temporaryResources.close();
    }

    // return content mime type
    public String detectType(InputStream in) throws IOException {
        TikaInputStream tikaInputStream = TikaInputStream.get(in, temporaryResources);

        return tika.detect(tikaInputStream);
    }

    public boolean isExcelFile(InputStream in) throws IOException{
        // see https://stackoverflow.com/a/4212908/1700467 for information on mimetypes
        String type = detectType(in);
        return type.startsWith("application/vnd.ms-excel") || //for Micorsoft document
                type.startsWith("application/vnd.openxmlformats-officedocument.spreadsheetml"); // for OpenOffice xml format
    }
}

See this answer on mime types.

skadya
  • 4,330
  • 19
  • 27
  • 1
    You need to give Tika the whole file via a `TikaInputStream` if you want it to do Container-aware detection to get the OLE2 subtype, to work out if it's DOC or XLS, you can't get that from Mime Magic alone – Gagravarr Oct 01 '17 at 02:34
  • try catch blocks should never be used as an if statement – JSON Oct 21 '20 at 17:49
2

You can work with Apache POI's - HSSF module.
That model (library) is written to read and write xls files (and latest for xlsx as well - although these are different languages).
With this code...

InputStream ExcelFileToRead = new FileInputStream("FileNameWithLink.xls");
HSSFWorkbook wb = new HSSFWorkbook(ExcelFileToRead);
HSSFSheet sheet = wb.getSheetAt(0);

...you can detect if it is readable xls file.
Going deeper you can use this code to try reading it etc. Actually that module is really easy to use.
There can be situations that it technically is .xls file, but it may not be readable (there can be various problems with it).
Extra - XSSF is for .xlsx and HSSF is for .xls.

I haven't used other techniques as I always want to be sure that I will be able read that file later.

Szymon Stepniak
  • 40,216
  • 10
  • 104
  • 131
Mike B
  • 2,756
  • 2
  • 16
  • 28
  • I noticed that this lib throws different exceptions for .doc file, for encrypted xls and for pdf. I want to distinquisg these cases – gstackoverflow Sep 28 '17 at 09:44
  • could you explain for my tired brain a bit more? distinguish more? aren't those different exceptions distinguishing those cases enough? – Mike B Sep 28 '17 at 10:16
  • I just noticed for these types. I am not sure that for some another type I will see same exception – gstackoverflow Sep 28 '17 at 11:03
2

You can use docx4j. Load the file with OpcPackage.load() and then check the content type.

OpcPackage.load()

 * Convenience method to create a WordprocessingMLPackage
 * or PresentationMLPackage
 * from an inputstream (.docx/.docxm, .ppxtx or Flat OPC .xml).
 * It detects the convenient format inspecting two first bytes of stream (magic bytes). 
 * For office 2007 'x' formats, these two bytes are 'PK' (same as zip file)  

load() returns a OpcPackage which is the abstract class that GloxPackage, PresentationMLPackage, SpreadsheetMLPackage, WordprocessingMLPackage are based on. So this should work for word, excel and powerpoint docs.

A basic check

public final String XLSX_FILE = "application/vnd.openxmlformats-officedocument.presentationml.presentation.main+xml";
public final String WORD_FILE = "application/vnd.openxmlformats-officedocument.wordprocessingml.document.main+xml";
public final String UNKNOWN_FILE = "UNKNOWN";



public boolean isFileXLSX(String fileLocation) {
    return getContentTypeFromFile(fileLocation).equals(XLSX_FILE);
}


public String getContentTypeFromFile(String fileLocation) {
    try {
        return OpcPackage.load(new File(fileLocation)).getContentType();
    } catch (Docx4JException e) {
        return UNKNOWN_FILE;
    }
}

You should see values like

application/vnd.openxmlformats-officedocument.wordprocessingml.document.main+xml
application/vnd.openxmlformats-officedocument.presentationml.presentation.main+xml
denov
  • 11,180
  • 2
  • 27
  • 43
  • Is it unversal solution for all file types? – gstackoverflow Sep 29 '17 at 08:49
  • i added some more details. this should work for word, excel and powerpoint - any format that docx4j supports. I have only tested that code above with a .docx and .xlsx files. – denov Sep 29 '17 at 17:53
  • Actually I want to have the method which accept file and return true if it is excel and false in other cases. It can be .doc, .pdf, ..exe and so on – gstackoverflow Sep 30 '17 at 07:10
  • hmm.. so after adding some more code i realize you wrote '.doc / .xls' - the old office format. i'm pretty sure this method only works on the 'x' format - the xml based formats. – denov Sep 30 '17 at 22:01