29

I've got bytes array in database.

How to extract file extension (mime/type) from byte array in java?

Prakash K
  • 11,669
  • 6
  • 51
  • 109
emilan
  • 12,825
  • 11
  • 32
  • 37
  • `byte array` is an array of bytes and nothing more. If you have a `byte array` - you can't say what's stored there. You can try guessing by the contents of the byte array, but it will be nothing more but guessing. – bezmax Apr 06 '12 at 07:17
  • I don't think so, I can do it with MagicMatch class, but for this I need to import external jar. I'm seeking for something else. byte[] data = ... MagicMatch match = Magic.getMagicMatch(data); String mimeType = match.getMimeType(); – emilan Apr 06 '12 at 07:20
  • What I meant to say - there is no mimetype saved within a byte array anywhere (except for some datatypes which support it). For example if you have a `Hello World.txt` file written to byte array, you would have 11 bytes in it: `H,e,l,l,o, ,w,o,r,l,d`. There is no mimetype as you can see. What `Magic` library does - it tries to **guess** filetype by the contents of the file. Kind of like anti-virus software looks for patterns of viruses, these kinds of libraries try to **guess** the mimetype by some specific patterns common for those mimetypes. – bezmax Apr 06 '12 at 07:24
  • I guess you are right :) Maybe I need to save additional column in my DB for file extension. – emilan Apr 06 '12 at 07:31
  • Your question is meaningless. Byte arrays are not files and do not have file extensions. – user207421 Apr 06 '12 at 09:44
  • 1
    @EJP The question is not meaningless. Clearly he is referring to the contents of the byte array. Please be considerate with your postings and use discretion before publicizing your ignorance. – Mr. Port St Joe Dec 08 '16 at 14:23

3 Answers3

55

It turned out that there is a decent method in JDK's URLConnection class, please refer to the following answer: Getting A File's Mime Type In Java

If one needs to extract file extension from byte array instead of file, one should simply use java.io.ByteArrayInputStream (class to read bytes specifically from byte arrays) instead of java.io.FileInputStream (class to read bytes specifically from files) like in the following example:

byte[] content = ;
InputStream is = new ByteArrayInputStream(content);
String mimeType = URLConnection.guessContentTypeFromStream(is);
 //...close stream

Hope this helps...

Community
  • 1
  • 1
Yuriy Nakonechnyy
  • 3,742
  • 4
  • 29
  • 41
  • Only helpful if you can write the byte array content to a file then read that back again, which wasn't part of the original question. (I'm in the same situation.) – jmkgreen Jan 30 '13 at 10:05
  • 8
    No-no, this solution works with any stream of bytes - please refer again to my edited answer. In Java, `InputStream` is an abstraction over `anything from where bytes can be read`, so when somewhere `InputStream` is needed - it's just a matter of finding correct `InputStream` implementation. – Yuriy Nakonechnyy Jan 30 '13 at 12:33
  • @SachinHR please elaborate your case and I'll try to help you – Yuriy Nakonechnyy Aug 17 '20 at 11:21
14

If this is for storing a file that is uploaded:

  • create a column for the filename extension
  • create a column for the mime type as sent by the browser

If you don't have the original file, and you only have bytes, you have a couple of good solutions.

If you're able to use a library, look at using mime-util to inspect the bytes:

http://technopaper.blogspot.com/2009/03/identifying-mime-using-mime-util.html

If you have to build your own byte detector, here are many of the most popular starting bytes:

"BC" => bitcode,
"BM" => bitmap,
"BZ" => bzip,
"MZ" => exe,
"SIMPLE"=> fits,
"GIF8" => gif,
"GKSM" => gks,
[0x01,0xDA].pack('c*') => iris_rgb,
[0xF1,0x00,0x40,0xBB].pack('c*') => itc,
[0xFF,0xD8].pack('c*') => jpeg,
"IIN1" => niff,
"MThd" => midi,
"%PDF" => pdf,
"VIEW" => pm,
[0x89].pack('c*') + "PNG" => png,
"%!" => postscript,
"Y" + [0xA6].pack('c*') + "j" + [0x95].pack('c*') => sun_rasterfile,
"MM*" + [0x00].pack('c*') => tiff,
"II*" + [0x00].pack('c*') => tiff,
"gimp xcf" => gimp_xcf,
"#FIG" => xfig,
"/* XPM */" => xpm,
[0x23,0x21].pack('c*') => shebang,
[0x1F,0x9D].pack('c*') => compress,
[0x1F,0x8B].pack('c*') => gzip,
"PK" + [0x03,0x04].pack('c*') => pkzip,
"MZ" => dos_os2_windows_executable,
".ELF" => unix_elf,
[0x99,0x00].pack('c*') => pgp_public_ring,
[0x95,0x01].pack('c*') => pgp_security_ring,
[0x95,0x00].pack('c*') => pgp_security_ring,
[0xA6,0x00].pack('c*') => pgp_encrypted_data,
[0xD0,0xCF,0x11,0xE0].pack('c*') => docfile
joelparkerhenderson
  • 34,808
  • 19
  • 98
  • 119
2

Maybe I need to save additional column in my DB for file extension.

That is a better solution than attempting to deduce a mimetype based on the database content, for (at least) the following reasons:

  • If you have a mime type from the document source, you can store and use that.
  • You could (potentially) ask the user to specify a mimetype when they lodge the document.
  • If you have to use some heuristic-based scheme for figuring out a mimetype:
    • you can do the work once before creating the table row, rather than N times after extracting it, and
    • you can report cases where the heuristic gives no good answer, and maybe ask the user to say what the file type really is.

(I'm making some assumptions that may not be warranted, but the question doesn't give any clues on how the larger system is intended to work.)

Stephen C
  • 698,415
  • 94
  • 811
  • 1,216