28

I know that protocol-buffers are a serialized format that requires a message format in the .proto in order to read back properly. But I have a file that I do not know the proper message format for because it isn't published. What I am trying to do is to reverse engineer the data myself so i can reconstruct the messages. To do this I need to read the raw file out where I can pick up the field numbers, types and values.

Is there a program that will do this (preferrably in python but C/C++ is cool too)?

tjac
  • 797
  • 2
  • 9
  • 16
  • 2
    Does the last answer in [this](http://stackoverflow.com/questions/7343867/raw-decoder-for-protobufs-format) help? – em.eh.kay Dec 19 '12 at 16:52
  • There's a problem here in that the wire-data is ambiguous, and can only be *reliably* interpreted with a spec. For example, length-prefix data could be: a raw blob, a string, a packed array, or a sub-message. A varint could be zigzag or straight; could be signed or unsigned. A fixed32 could be a floating point, or could be a signed or unsigned integer. The tool that em.eh.kay links to could be useful: just... be aware of the limitations – Marc Gravell Dec 21 '12 at 12:43
  • @em.eh.kay Thanks. That might very well do what I need to get through the first level of the file. I ended up writing a parser based on the wire protocol specs on the google page about it. But this might be easier than doing that. – tjac Dec 22 '12 at 03:12
  • @Marc Gravell, I'm aware that the protocol is open ended in terms of what a wire date type may actually represent. As I've gotten more into this I've learned that the hard way. Thanks to you both for your help with this! – tjac Dec 22 '12 at 03:12
  • see my answer here: https://stackoverflow.com/questions/7914034/how-to-decode-protobuf-binary-response/48868239#48868239 – AutomatedMike Sep 04 '19 at 11:57

3 Answers3

22

After doing some digging, I wrote my own tool to do this. There were other ways to do this, I'm sure, but this tool looks at the description in the source binary. It reads in the description stream and spits out a pseudo-.proto file. From that .proto file you can compile your own pb file and decode your stream.

import sys
import struct

# Helper functions ------------------------------------------------------------
# this comes largely straight out of the google protocol-buffers code for DecodeVarint(internal\decoder.py)
# with a few tweaks to make it work for me
def readVarInt(buffer, pos):
  mask = (1 << 64) - 1
  result = 0
  shift = 0
  startPos = pos
  while 1:
    b = ord(buffer[pos])
    result |= ((b & 0x7f) << shift)
    pos += 1
    if not (b & 0x80):
      if result > 0x7fffffffffffffff:
        result -= (1 << 64)
        result |= ~mask
      else:
        result &= mask
      return (result, pos, pos-startPos)
    shift += 7
    if shift >= 64:
      raise Error('Too many bytes when decoding varint.')

def readQWORD(d, pos):
    try:
        v = struct.unpack("<Q", d[pos:pos+8])[0]
    except:
        print "Exception in readQWORD"
        print sys.exc_info()
        return (None, pos)
    pos += 8
    return (v, pos);

def readDWORD(d, pos):
    try:
        v = struct.unpack("<L", d[pos:pos+4])[0]
    except:
        print "Exception in readDWORD"
        print sys.exc_info()
        return (None, pos)
    pos += 4
    return (v, pos);

def readBYTE(d, pos):
    try:
        v = struct.unpack("<B", d[pos:pos+1])[0]
    except:
        print "Exception in readBYTE"
        print sys.exc_info()
        return (None, pos)
    pos += 1
    return (v, pos);

# returns (value, new position, data type, field ID, and value's length)
def readField(d, pos):
    # read field and type info
    (v, p) = readBYTE(d, pos);
    datatype = v & 7;
    fieldnum = v >> 3;
        
    if datatype == 0:       # varint
        (v, p, l) = readVarInt(d, p)
        return (v, p, datatype, fieldnum, l)    
    elif datatype == 1: # 64-bit
        (v,p) = readQWORD(d, p)
        return (v, p, datatype, fieldnum, 8)    
    elif datatype == 2: # varlen string/blob
        (v, p, l) = readVarInt(d, p)    # get string length
        return (d[p:p+v], p+v, datatype, fieldnum, v)       
    elif datatype == 5: # 32-bit value
        (v,p) = readDWORD(d, p)
        return (v, p, datatype, fieldnum, 4)
    else:
        print "Unknown type: %d [%x]\n" % (datatype, pos)
        return (None, p, datatype, fieldnum, 1);
        
# PARSERS ---------------------------------------------------------------------

#  Parse DescriptorProto field
def PrintDescriptorProto(data, size, prefix):
    pos = 0
    
    while pos < size:
        (d, p, t, fid, l)  = readField(data, pos);
        pos = p
        
        if fid == 1: print "%smessage %s {" % (prefix,d)
        elif fid == 2: PrintFieldDescriptorProto(d, l, prefix+"\t") # FieldDescriptorProto
        elif fid == 3: PrintDescriptorProto(d, l, prefix+"\t") # DescriptorProto
        elif fid == 4: PrintEnumDescriptorProto(d, l, prefix+"\t") # EnumDescriptorProto
        elif fid == 5: 
            print "%sextension_range:" % (prefix)
            PrintDescriptorProto(d, l, prefix+"\t") # ExtensionRange
        elif fid == 6: print "%sextension: %s" % (prefix,d) # FieldDescriptorProto
        elif fid == 7: print "%soptions: %s" % (prefix,d) # MessageOptions
        else: print "***UNKNOWN fid in PrintDescriptorProto %d" % fid
    
    print "%s}" % prefix
    
# Parse EnumDescriptorProto
def PrintEnumDescriptorProto(data, size, prefix):
    pos = 0
    while pos < size:
        (d, p, t, fid, l)  = readField(data, pos);
        pos = p
        
        if fid == 1: print "%senum %s {" % (prefix,d)
        elif fid == 2: PrintEnumValueDescriptorProto(d, l, prefix+"\t") # EnumValueDescriptorProto
        elif fid == 3: # EnumOptions
            print "%soptions" % prefix
        else: print "***UNKNOWN fid in PrintDescriptorProto %d" % fid
    print "%s};" % prefix
    
    
# Parse EnumValueDescriptorProto
def PrintEnumValueDescriptorProto(data, size, prefix):
    pos = 0
    enum = {"name": None, "number": None}
    while pos < size:
        (d, p, t, fid, l)  = readField(data, pos);
        pos = p
        if fid == 1: enum['name'] = d
        elif fid == 2: enum['number'] = d
        elif fid == 3: # EnumValueOptions
            print "%soptions: %s" % (prefix,d)
        else: print "***UNKNOWN fid in PrintDescriptorProto %d" % fid
        
    print "%s%s = %s;" % (prefix, enum['name'], enum['number'])

# Parse FieldDescriptorProto
def PrintFieldDescriptorProto(data, size, prefix):
    pos = 0
    field = {"name": None, "extendee": None, "number": None, "label": None, "type": None, "type_name": None, "default_value": None, "options": None}
    while pos < size:
        (d, p, t, fid, l)  = readField(data, pos);
        pos = p
        
        if fid == 1: field['name'] = d
        elif fid == 2: field['extendee'] = d
        elif fid == 3: field['number'] = d
        elif fid == 4: 
            if d == 1: field['label'] = "optional"
            elif d == 2: field['label'] = "required"
            elif d == 3: field['label'] = "repeated"
            else: print "{{Label: UNKNOWN (%d)}}" % (prefix,d)
            
        elif fid == 5: 
            types = {1: "double", 
                             2: "float", 
                             3: "int64", 
                             4: "uint64", 
                             5: "int32", 
                             6: "fixed64",
                             7: "fixed32",
                             8: "bool",
                             9: "string",
                             10: "group", 
                             11: "message",
                             12: "bytes",
                             13: "uint32",
                             14: "enum",
                             15: "sfixed32",
                             16: "sfixed64",
                             17: "sint32",
                             18: "sint64" }
            if d not in types:
                print "%sType: UNKNOWN(%d)" % (prefix,d)
            else:
                field['type'] = types[d]
                
                
        elif fid == 6: field["type_name"] = d
        elif fid == 7: field["default_value"] = d
        elif fid == 8: field["options"] = d
        else: print "***UNKNOWN fid in PrintFieldDescriptorProto %d" % fid

    output = prefix
    
    if field['label'] is not None: output += " %s" % field['label']
    output += " %s" % field['type']
    output += " %s" % field['name']
    output += " = %d" % field['number']
    if field['default_value']: output += " [DEFAULT = %s]" % field['default_value']
    output += ";"
    print output


#  Parse ExtensionRange field
def PrintExtensionRange(data, size, prefix):
    pos = 0
    while pos < size:
        (d, p, t, fid, l)  = readField(data, pos);
        pos = p
        print "%stype %d, field %d, length %d" % (prefix, t, fid, l)
        
        if fid == 1: print "%sstart: %d" % (prefix,d)
        elif fid == 2: print "%send: %d" % (prefix,d)
        else: print "***UNKNOWN fid in PrintExtensionRange %d" % fid

    
def PrintFileOptions(data, size, prefix):
    pos = 0
    while pos < size:
        (d, p, t, fid, l)  = readField(data, pos);
        pos = p
    
        if fid == 1: print "%soption java_package = \"%s\";" % (prefix,d)
        elif fid == 8: print "%soption java_outer_classname = \"%s\"" % (prefix,d)
        elif fid == 10: print "%soption java_multiple_files = %d" % (prefix,d)
        elif fid == 20: print "%soption java_generate_equals_and_hash = %d" % (prefix,d)
        elif fid == 9: print "%soption optimize_for = %d" % (prefix,d)
        elif fid == 16: print "%soption cc_generic_services = %d" % (prefix,d)
        elif fid == 17: print "%soption java_generic_services = %d" % (prefix,d)
        elif fid == 18: print "%soption py_generic_services = %d" % (prefix,d)
        elif fid == 999: print "%soption uninterpreted_option = \"%s\"" % (prefix,d)        # UninterpretedOption
        else: print "***UNKNOWN fid in PrintFileOptions %d" % fid

# -----------------------------------------------------------------------------
# Main function. 
def ParseProto(filename, offset, size):
    f = open(filename, "rb").read()
    
    data = f[offset:offset+size]
    
    pos = 0
    while pos < size:
        (d, p, t, fid, l)  = readField(data, pos);
        pos = p
        #print "type %d, field %d, length %d" % (t, fid, l)
        
        if fid == 1: print "// source filename: %s" % d
        elif fid == 2: print "package %s;" % d
        elif fid == 3: print "import \"%s\"" % d
        elif fid == 4: PrintDescriptorProto(d, l, "")
        elif fid == 5: print "EnumDescriptorProto: %s" % d
        elif fid == 6: print "ServiceDescriptorProto: %s" % d
        elif fid == 7: print "FieldDescriptorProto: %s" % d
        elif fid == 8: PrintFileOptions(d, l, "")
        else: print "***UNKNOWN fid in ParseProto %d" % fid
    return {}



# main
if __name__ == "__main__":
    if len(sys.argv) != 4:
        print "Usage: %s binaryfile offset size" % sys.argv[0]
        sys.exit(0)
        
    ParseProto(sys.argv[1], int(sys.argv[2]), int(sys.argv[3]))
wim
  • 338,267
  • 99
  • 616
  • 750
tjac
  • 797
  • 2
  • 9
  • 16
  • by the way, i didnt implement all of the message types. Just the ones I needed for my particular project. You can implement your own by looking at the describe.proto file in the protocol-buffers src directory. – tjac Dec 25 '12 at 01:54
  • the same approach is used in: http://www.sysdream.com/reverse-engineering-protobuf-apps – Alexander Oh Aug 23 '13 at 19:24
  • sysdream's source code for their protod.py file can be found on their github page: https://github.com/sysdream/Protod – Apteryx Mar 22 '16 at 22:33
  • When I run this, I get the classic streams of nonsense characters prefaced by `package`, despite the fact that I know my protobuf files are higly structured. An example line is `package NbX90c@X9??v?!???S?@;` . What might be causing this? – David Ferris May 22 '17 at 20:45
  • not exactly sure. the code was written for protobufs v2, are you using v3? not sure if that'd change anything, but its all i got without seeing the stream you are working against. – tjac Jul 23 '17 at 04:10
12

I found it useful to convert the raw message to text using protoc --decode_raw < file. If the file actually contains multiple (length-prefixed) messages, save them into separate files first.

Community
  • 1
  • 1
Michel de Ruiter
  • 7,131
  • 5
  • 49
  • 74
3

Try this online protobuf decoder: http://168.138.55.177/

It's open source: https://github.com/aj3423/aproto

BTW I'm the author.

enter image description here

aj3423
  • 2,003
  • 3
  • 32
  • 70