0

I'm managing the upload of different types of files on server side. I have implemented an action that takes care of returning the file format by comparing the byte sequence of the file with the byte sequence of specific file formats. While searching I found this answer which helped me a lot. So I implemented my action like this:

private static MediaFormat GetFormat(byte[] bytes, string fileName = null)
{
    // these are my file formats byte sequences
    byte[] jpeg = new byte[] { 255, 216, 255, 224 };
    byte[] jpeg2 = new byte[] { 255, 216, 255, 225 };
    byte[] png = new byte[] { 137, 80, 78, 71 };
    byte[] doc = new byte[] { 208, 207, 17, 224, 161, 177, 26, 225 };
    byte[] docx_zip = new byte[] { 80, 75, 3, 4 };
    byte[] pdf = new byte[] { 37, 80, 68, 70, 45, 49, 46 };

    if (jpeg.SequenceEqual(bytes.Take(jpeg.Length)))
        return MediaFormat.jpg;
    if (jpeg2.SequenceEqual(bytes.Take(jpeg2.Length)))
        return MediaFormat.jpg;
    if (png.SequenceEqual(bytes.Take(png.Length)))
        return MediaFormat.png;
    if (doc.SequenceEqual(bytes.Take(doc.Length)))
        return MediaFormat.doc;
    if (docx_zip.SequenceEqual(bytes.Take(docx_zip.Length)))
    {
        if (!string.IsNullOrEmpty(fileName) && fileName.Contains(".zip", StringComparison.OrdinalIgnoreCase))
            return MediaFormat.zip;

        return MediaFormat.docx;
    }
    if (pdf.SequenceEqual(bytes.Take(pdf.Length)))
        return MediaFormat.pdf;

    return MediaFormat.unknown;
}

In the answer I found (and shared in this question) the creator indicates a link to a site where I could find other sequences of bytes to identify other formats but unfortunately the site is 404 so I couldn't find all the formats I needed, PowerPoint (.ppt, .pptx) and Excel and CSV (.xlx, .xlxs, .csv) and if even .txt were possible.

Could anyone tell me what the correct byte sequences are or where can I find them? Thanks so much!

Matt P
  • 25
  • 7
  • https://github.com/neilharvey/FileSignatures – mjwills Jun 13 '21 at 14:09
  • 1
    .pptx, xlsx, docx etc. actually .zip file formats, so header is 0x50, 0x4B, 0x03, 0x04. For .ppt, xls, doc there is own MS binary format is used, and they are also archives of files and have 0xD0, 0xCF, 0x11, 0xE0, 0xA1, 0xB1, 0x1A, 0xE1 header. Seems you can't just findout by it's header, but only by analizing (unzipping or reading byte streams inside). .csv & .txt can be indentified (if they are in UTF-8 encoding) with BOM (Byte order mark) – Oleg Skripnyak Jun 13 '21 at 14:37
  • Thanks @OlegSkripnyak, excuse the ignorance, but how can I transform `0xD0, 0xCF, 0x11, 0xE0, 0xA1, 0xB1, 0x1A, 0xE1` into an array of bytes? thanks! – Matt P Jun 13 '21 at 15:11
  • 1
    @MattP Just write it as is, it's hex presentation of integer numbers and understandable by any C like programming language. 0xFF == 255, 0x80 == 128 – Oleg Skripnyak Jun 13 '21 at 15:39

0 Answers0