Proper way to differentiate pst and dbx files in bash shell

Question

I want to identify the file-format of the input file given to my shell script - whether a .pst or a .dbx file. I checked How to check the extension of a filename in a bash script?. That one deals with txt files and two methods are given there -

check if the extension is txt
check if the mime type is application/text etc.

I tried file -ib <filename> on a .pst and a .dbx file and it showed application/octet-stream for both. However, if I just do file <filename>, then I get

this for the dbx file -

file1.dbx: Microsoft Outlook Express DBX File Message database

and this for the pst file -

file2.pst: Microsoft Outlook binary email folder (Outlook >=2003)

So, my questions are -

is it better to use mime type detection everytime when the output can be anything and we need a proper check?
How to apply mime type check in this case - both returning "application/octet-stream"?

Update
I didn't want to do an extension based detection because it seems we just can't be sure on a Unix system, that a .dbx file truly is a dbx file. Since file <filename> returns a line which contains the correct information of the file (e.g. "Microsoft Outlook Express DBX File Message database"). That means the file command is able to identify the file type properly. Then why does it not get the correct information in file -ib <filename> command?
Will parsing the string output of file <filename> be fine? Is it advisable assuming I only need to identify a narrow set of data storage files of outlook family (MS Outlook Express, MS Office Outlook 2003,2007,2010 etc.). A small text identifier like application/dbx which could be compared would be all I need.

it's returning `application/octet-stream` for both since they are both binary files, and it hasn't been instructed to do otherwise. There's nothing stopping you from adding it to the system's `magic.mime` (except for not having actual mime types for the dbx/pst file types) — Hasturkun, Feb 02 '11 at 09:47

Michael Dillon · Answer 1 · 2011-02-02T09:41:00.973

The file command relies on having a file type detection database which includes rules for the file types that you expect to encounter. It may not be possible to recognize these file types if the file content doesn't have a unique code near the beginning of the file.

Note that the -i option to emit mime types actually uses a separate "magic" numbers file to recognize file types rather than translating long descriptions to file types. It is quite possible for these two databases to be out of sync. If your application really needs to recognize these two file types I suggest that you look at the Linux source code for "file" to see how they recognize them and then code this recognition algorithm right into your app.

If you want to do the equivalent of DOS file type detection, then strip the extension off the filename (everything after the last period) and look up that string in your own table where you define the types that you need.

Proper way to differentiate pst and dbx files in bash shell

1 Answers1