0

Is there any way to interrogate a file to see if it is an Excel document without looking at the extension?

I have a situation where I need to convert a number of files to PDF (from Excel and Word). In some instances, I am finding files that have been saved with a .doc extension, but are really Excel files. I have renamed a few to .xls and they launch just fine in Excel. They show up in the finder as Word 95 docs, I guess because they are binary and have a .doc extension.

Is there some standard header or text in Excel that I can string search for (in powershell) to distinguish mis-named files.

Mike
  • 3,186
  • 3
  • 26
  • 32
  • You can use some third-party tool. See, for example, https://stackoverflow.com/questions/32460177/how-can-i-find-out-a-files-mime-typecontent-type-on-windows – František Žiačik Jan 02 '18 at 22:02
  • @FrantišekŽiačik As it turns out, there was no quick and dirty way to do this. Most utilities do rely on the file extension to determine mime type. I needed to interrogate the file, as these are mis-named. I ended up using Cygwin's file utility, which correctly identifies the mime type. Please post your comment as an answer I will accept it. – Mike Jan 03 '18 at 21:08

3 Answers3

0

If you have recent version of xl and word files, they are really just .zip files. You can use your favorite PKZip reader and try to open them. If there is an XL folder or a WORD folder then that seems indicative of the content type. You has easily check other office file types by renaming them to .zip and opening them. just don't forget to rename them back. You can usually tell it's a .zip file if the first two characters are PK.

No Refunds No Returns
  • 8,092
  • 4
  • 32
  • 43
0

According to Wikipedia, the magic number for legacy pre-2007 Office documents (doc, xls, ppt, msg) is D0 CF 11 E0 A1 B1 1A E1. This number will be at the start of the file. There is also an article on the file format itself that includes the header struct.

This format has also been used for some other files in Windows, be careful when making assumptions.

Aurelia
  • 1,052
  • 6
  • 28
  • I started to go down this rabbit hole last night. This turns out to be the header for the olecf (container file). I was hoping there was some sort of content type that describes the embedded stream (content) that was easy to identify. I gave up on this endeavor, mainly due to lack of time... – Mike Jan 03 '18 at 21:00
  • Yes - early versions of Office use a slightly stripped down memory dump of the open document in a container format. It's not pretty. – Aurelia Jan 04 '18 at 00:10
0

You can use Get-contents on a file and see if it contains the tag "[Content_Types].xml"

Get-content "C:\Files.doc" -match "[Content_Types].xml"

I just opened a bunch of Excel Documents in Notepad and they all seem to contain the lines

PK Somerandomgarbage
    [Content_Types].xml morerandomgarbage
Shadowzee
  • 537
  • 3
  • 15
  • I had hoped it would have been that simple, my files didn't contain the [content_types] tag, although strangely enough it did have the word Excel in them. I ended up not going this route, because i was unsure if it offered enough consistency to get the job done. – Mike Jan 03 '18 at 21:03
  • You could also load up the WORD/EXCEL com objects, and use those to load in the documents. That could take some time, but Word doesn't load Excel documents very well and vice versa – Shadowzee Jan 03 '18 at 23:09