4

I was working on a compound file which contains several streams. I'm frustrated how to figure out the content of each stream. I don't know if these bytes are text or mp3 or video. for example: is there a way to understand what types of data could these bytes are?

b'\x00\x00\x00\x00\x00\x00\x00\x00\x1f\x8b\x08\x00\x00\x00\x00\x00\x00\x0bz\xcc\xc9\xc8\xc0\xc0\x00\xc2?\x82\x1e<\x0ec\xbc*8\x19\xc8i\xb3W_\x0b\x14bH\x00\xb2-\x99\x18\x18\xfe\x03\x01\x88\xcf\xc0\x01\xc4\xe1\x0c\xf9\x0cE\x0c\xd9\x0c\xc5\x0c\xa9\x0c%\x0c\x86`\xcd \x0c\x020\x1a\x00\x00\x00\xff\xff\x02\x080\x00\x96L~\x89W\x00\x00\x00\x00\x80(\\B\xefI;\x9e}p\xfe\x1a\xb2\x9b>(\x81\x86/=\xc9xH0:Pwb\xb7\xdck-\xd2F\x04\xd7co'
Raul
  • 2,745
  • 1
  • 23
  • 39
  • 2
    Possible duplicate of [Python 3 - Encode/Decode vs Bytes/Str](https://stackoverflow.com/questions/14472650/python-3-encode-decode-vs-bytes-str) – m0etaz Nov 13 '18 at 19:30
  • 2
    As in, "how do I tell if these bytes comprise an mp3, or a video, or an image, or something else?"? There's no universal way of determining a data format. Some formats have convenient self-identifying header data, and some don't. – Kevin Nov 13 '18 at 19:35
  • 2
    Your question is very unclear. What exactly are you trying to do? – Joel Nov 13 '18 at 19:35
  • 1
    @Kevin that is exactly what I want to do. is there a technique or pattern used to test these bytes to get close for something?? how to read the header? all what I have is bytes – Ibrahim Kais Ibrahim Nov 13 '18 at 19:41
  • 1
    Compare your bytes against Every. Known. Filetype. That's it. It's not `magic`; that is how `file` works. (Descriptions of both of these two terms can be found in your favourite `man` version.) – Jongware Nov 13 '18 at 20:22
  • Don't edit your answer into your question. Instead, post it as an answer once the question has been reopened. – Robert Columbia Nov 26 '18 at 13:33
  • @RobertColumbia the question is closed for me to add my answer that is why I added it to the question – Ibrahim Kais Ibrahim Nov 26 '18 at 13:34
  • The reason it is closed is because other users don't think it's ready for answers. It has two reopen votes now, wait patiently to see if it gets reopened. If it doesn't, you can talk to us in chat and ask for more help getting it reopened. Please don't violate our rules by bypassing the close system. – Robert Columbia Nov 26 '18 at 13:36
  • 1
    Your question currently has three reopen votes. You can come to chat and ask for help getting it reopened. – Robert Columbia Nov 27 '18 at 15:29
  • The question has now been reopened. – Robert Columbia Nov 27 '18 at 15:59

1 Answers1

2

Yes, there is away to figure out each stream content. there is a signature for each file on this planet in addition to extension which is not reliable. it might be removed or falsely added.

So what is the signature?

In computing, a file signature is data used to identify or verify the contents of a file. In particular, it may refer to:

  • File magic number: bytes within a file used to identify the format of the file; generally a short sequence of bytes (most are 2-4 bytes long) placed at the beginning of the file; see list of file signatures

  • File checksum or more generally the result of a hash function over the file contents: data used to verify the integrity of the file contents, generally against transmission errors or malicious attacks. The signature can be included at the end of the file or in a separate file.

I used the magic number to define the magic number term I'm copying this from Wikipedia

In computer programming, the term magic number has multiple meanings. It could refer to one or more of the following:

  • Unique values with unexplained meaning or multiple occurrences which could (preferably) be replaced with named constants
  • A constant numerical or text value used to identify a file format or protocol; for files, see List of file signatures
  • Distinctive unique values that are unlikely to be mistaken for other meanings(e.g., Globally Unique Identifiers)

in the second point it is a certain sequence of bytes like

PNG (89 50 4E 47 0D 0A 1A 0A) 

or

BMP (42 4D)

So how to know the magic number of each file?

in this article "Investigating File Signatures Using PowerShell" we find the writer created a wonderful power shell function to get the magic number also he mentioned a tool and I'm copying this from his article

PowerShell V5 brings in Format-Hex, which can provide an alternative approach to reading the file and displaying the hex and ASCII value to determine the magic number.

form Format-Hex help I'm copying this description

The Format-Hex cmdlet displays a file or other input as hexadecimal values. To determine the offset of a character from the output, add the number at the leftmost of the row to the number at the top of the column for that character.

This cmdlet can help you determine the file type of a corrupted file or a file which may not have a file name extension. Run this cmdlet, and then inspect the results for file information.

this tool is very good also to get the magic number of a file. Here is an example enter image description here

another tool is online hex editor but to be onset I didn't understand how to use it.

now we got the magic number but how to know what type of data or is that file or stream? and that is the most good question. Luckily there are many database for these magic numbers. let me list some

  1. File Signatures
  2. FILE SIGNATURES TABLE
  3. List of file signatures

for example the first database has a search capability. just enter the magic number with no spaces and search

enter image description here

after you may find. Yes, may. There is a big possibility that you won't directly find the file type in question.

I faced this and solved it by testing the streams against specific types of signatures. Like PNG I was searching for in a stream

def GetPngStartingOffset(arr):

    #targted magic Number for png (89 50 4E 47 0D 0A 1A 0A)
    markerFound = False
    startingOffset = 0
    previousValue = 0
    arraylength = range(0, len(arr) -1) 

    for i in arraylength:
        currentValue = arr[i]
        if (currentValue == 137):   # 0x89  
            markerFound = True
            startingOffset = i
            previousValue = currentValue
            continue

        if currentValue == 80:  # 0x50
            if (markerFound and (previousValue == 137)):
                previousValue = currentValue
                continue
            markerFound = False

        elif currentValue == 78:   # 0x4E
            if (markerFound and (previousValue == 80)):
                previousValue = currentValue
                continue
            markerFound = False

        elif currentValue == 71:   # 0x47
            if (markerFound and (previousValue == 78)):
                previousValue = currentValue
                continue
            markerFound = False

        elif currentValue == 13:   # 0x0D
            if (markerFound and (previousValue == 71)):
                previousValue = currentValue
                continue
            markerFound = False

        elif currentValue == 10:   # 0x0A
            if (markerFound and (previousValue == 26)):
                return startingOffset
            if (markerFound and (previousValue == 13)):
                previousValue = currentValue
                continue
            markerFound = False

        elif currentValue == 26:   # 0x1A
            if (markerFound and (previousValue == 10)):
                previousValue = currentValue
                continue
            markerFound = False
    return 0

Once this function found the magic number enter image description here

I split the stream and save the png file

    arr = stream.read()
    a = list(arr)
    B = a[GetPngStartingOffset(a):len(a)]
    bytesString = bytes(B)
    image = Image.open(io.BytesIO(bytesString))
    image.show()

At the end this is not an end to end solution but it is a way to figure out streams content Thanks for reading and Thanks for @Robert Columbia for his patience

  • "there is a signature for each file" No, @Kevin was right; Some don't, in particular, text files (except for some scripts) don't. – Tom Blodget Nov 27 '18 at 17:53
  • @TomBlodget yes you are right. text files doesn't have signature unless it has an encoding like utf-8. And that is because the ASCII characters are stored as it is. – Ibrahim Kais Ibrahim Nov 27 '18 at 18:15