16

I am looking for a scripting (or higher level programming) language (or e.g. modules for Python or similar languages) for effortlessly analyzing and manipulating binary data in files (e.g. core dumps), much like Perl allows manipulating text files very smoothly.

Things I want to do include presenting arbitrary chunks of the data in various forms (binary, decimal, hex), convert data from one endianess to another, etc. That is, things you normally would use C or assembly for, but I'm looking for a language which allows for writing tiny pieces of code for highly specific, one-time purposes very quickly.

Any suggestions?

none
  • 5,701
  • 28
  • 32
Eric Hansander
  • 944
  • 2
  • 11
  • 20
  • 2
    Interesting. I've never seen anything other than asm that lets you deal with binary data in this way. It's not clear that it would be that useful, though, since binary blobs are more or less useless without interpretation. Why not just stick with C and bitwise operations, or cast binary to Perl strings? – guns Jun 14 '09 at 18:36
  • Casting binary to strings is what I use today (in Python or Perl) but I have a feeling that there must be some smoother and more powerful way of doing this out there. I edited the question to try to explain why C is not the answer, in this case. – Eric Hansander Jun 14 '09 at 18:48
  • 1
    If this is specifically about core dumps and C programming, you may also want to check out the GNU BFD: http://en.wikipedia.org/wiki/Binary_File_Descriptor_library – none Jun 14 '09 at 20:09
  • Yes, I know about BFD, but I'm looking for something much more lightweight. And I will be using it for other things besides core dumps. – Eric Hansander Jun 14 '09 at 20:14
  • 5
    What I really need is a droid that understands the binary language of moisture vaporators. – Nosredna Jul 06 '09 at 21:56
  • 4
    lol! I saw the question title on front page and thought it's a fun question, that you were looking for a language in which random permutations of bits are valid programs (as perl is to text)! – Mehrdad Afshari Jul 06 '09 at 21:58
  • @Mehrdad... Does Core War fit your expectation? – Nosredna Jul 06 '09 at 23:53
  • @Nosredna: I think most assembly languages are (as they try to squeeze all different types of instruction in the minimum bytes, many random bit sequences represent a valid instruction (and would crash at runtime). – Mehrdad Afshari Jul 06 '09 at 23:58
  • Yeah. I've seen 6502 programs that let you jump into one place to do one thing and another place to do another thing. The instructions were either one byte or two bytes long, so it could get pretty weird, especially with self-modifying code. :-) – Nosredna Jul 07 '09 at 00:10

11 Answers11

30

Things I want to do include presenting arbitrary chunks of the data in various forms (binary, decimal, hex), convert data from one endianess to another, etc. That is, things you normally would use C or assembly for, but I'm looking for a language which allows for writing tiny pieces of code for highly specific, one-time purposes very quickly.

Well, while it may seem counter-intuitive, I found erlang extremely well-suited for this, namely due to its powerful support for pattern matching, even for bytes and bits (called "Erlang Bit Syntax"). Which makes it very easy to create even very advanced programs that deal with inspecting and manipulating data on a byte- and even on a bit-level:

Since 2001, the functional language Erlang comes with a byte-oriented datatype (called binary) and with constructs to do pattern matching on a binary.

And to quote informIT.com:

(Erlang) Pattern matching really starts to get fun when combined with the binary type. Consider an application that receives packets from a network and then processes them. The four bytes in a packet might be a network byte-order packet type identifier. In Erlang, you would just need a single processPacket function that could convert this into a data structure for internal processing. It would look something like this:

processPacket(<<1:32/big,RestOfPacket>>) ->
    % Process type one packets
    ...
;
processPacket(<<2:32/big,RestOfPacket>>) ->
    % Process type two packets
    ...

So, erlang with its built-in support for pattern matching and it being a functional language is pretty expressive, see for example the implementation of ueencode in erlang:

uuencode(BitStr) ->
<< (X+32):8 || <<X:6>> <= BitStr >>.
uudecode(Text) ->
<< (X-32):6 || <<X:8>> <= Text >>.

For an introduction, see Bitlevel Binaries and Generalized Comprehensions in Erlang.You may also want to check out some of the following pointers:

Yaakov Ellis
  • 40,752
  • 27
  • 129
  • 174
none
  • 5,701
  • 28
  • 32
  • Exactly what I was going to suggest. Erlang can do very cool things with binary data. – JesperE Jun 14 '09 at 19:13
  • 2
    I'd go with erlang too on this one – Faisal Vali Jun 14 '09 at 19:19
  • awesome answer! extremely thorough. – ine Jun 14 '09 at 20:43
  • 3
    I concur. Joe Armstrong always says that he just cannot believe that no other language has Bit literals and pattern matching on Bits. And I completely agree. I mean, they put Float literals (which really only a very few scientific programmers need) in every sh*tty language, but no Bit syntax? Why? If you have String literals and Regexps, adding Bitstrings and Bit Patterns wouldn't be such a big deal, would it? [Rant over.] – Jörg W Mittag Jun 14 '09 at 23:08
6

perl's pack and unpack ?

toolkit
  • 49,809
  • 17
  • 109
  • 135
  • 1
    And regexps that work happily on binary -- a feature I helped debug back in the day through my (ab)use of it. And you can pack/unpack to binary if you want to match on that. Recent implementations with their unicode support seemed to have muddied the water but I think all that can be turned off. – George Phillips Jul 06 '09 at 22:11
4

The Python bitstring module was written for this purpose. It lets you take arbitary slices of binary data and offers a number of different interpretations through Python properties. It also gives plenty of tools for constructing and modifying binary data.

For example:

>>> from bitstring import BitArray, ConstBitStream
>>> s = BitArray('0x00cf')                           # 16 bits long
>>> print(s.hex, s.bin, s.int)                       # Some different views
00cf 0000000011001111 207
>>> s[2:5] = '0b001100001'                           # slice assignment
>>> s.replace('0b110', '0x345')                      # find and replace
2                                                    # 2 replacements made
>>> s.prepend([1])                                   # Add 1 bit to the start
>>> s.byteswap()                                     # Byte reversal
>>> ordinary_string = s.bytes                        # Back to Python string

There are also functions for bit-wise reading and navigation in the bitstring, much like in files; in fact this can be done straight from a file without reading it into memory:

>>> s = ConstBitStream(filename='somefile.ext')
>>> hex_code, a, b = s.readlist('hex:32, uint:7, uint:13')
>>> s.find('0x0001')         # Seek to next occurence, if found
True
    

There are also views with different endiannesses as well as the ability to swap endianness and much more - take a look at the manual.

Glorfindel
  • 21,988
  • 13
  • 81
  • 109
Scott Griffiths
  • 21,438
  • 8
  • 55
  • 85
4

Take a look at python bitstring, it looks like exactly what you want :)

Glorfindel
  • 21,988
  • 13
  • 81
  • 109
buster
  • 1,071
  • 1
  • 10
  • 15
3

I'm using 010 Editor to view binary files all the time to view binary files. It's especially geared to work with binary files.

It has an easy to use c-like scripting language to parse binary files and present them in a very readable way (as a tree, fields coded by color, stuff like that).. There are some example scripts to parse zipfiles and bmpfiles.

Whenever I create a binary file format, I always make a little script for 010 editor to view the files. If you've got some header files with some structs, making a reader for binary files is a matter of minutes.

Wouter van Nifterick
  • 23,603
  • 7
  • 78
  • 122
2

Any high-level programming language with pack/unpack functions will do. All 3 Perl, Python and Ruby can do it. It's matter of personal preference. I wrote a bit of binary parsing in each of these and felt that Ruby was easiest/most elegant for this task.

Igor Krivokon
  • 10,145
  • 1
  • 37
  • 41
2

Why not use a C interpreter? I always used them to experiment with snippets, but you could use one to script something like you describe without too much trouble.

I have always liked EiC. It was dead, but the project has been resurrected lately. EiC is surprisingly capable and reasonably quick. There is also CINT. Both can be compiled for different platforms, though I think CINT needs Cygwin on windows.

R Ubben
  • 2,225
  • 16
  • 8
  • This was my thought as well. I don't know EiC, but cint accepts a loose dialect (it does implicitly (but still strong) typing when you don't specify types for newly introduced variable), which makes for a more RAD style when writing macros. – dmckee --- ex-moderator kitten Jun 14 '09 at 19:38
2

Python's standard library has some of what you require -- the array module in particular lets you easily read parts of binary files, swap endianness, etc; the struct module allows for finer-grained interpretation of binary strings. However, neither is quite as rich as you require: for example, to present the same data as bytes or halfwords, you need to copy it between two arrays (the numpy third-party add-on is much more powerful for interpreting the same area of memory in several different ways), and, for example, to display some bytes in hex there's nothing much "bundled" beyond a simple loop or list comprehension such as [hex(b) for b in thebytes[start:stop]]. I suspect there are reusable third-party modules to facilitate such tasks yet further, but I can't point you to one...

Glorfindel
  • 21,988
  • 13
  • 81
  • 109
Alex Martelli
  • 854,459
  • 170
  • 1,222
  • 1,395
1

Forth can also be pretty good at this, but it's a bit arcane.

Jason Baker
  • 192,085
  • 135
  • 376
  • 510
1

Well, if speed is not a consideration, and you want perl, then translate each line of binary into a line of chars - 0's and 1's. Yes, I know there are no linefeeds in binary :) but presumably you have some fixed size -- e.g. by byte or some other unit, with which you can break up the binary blob.

Then just use the perl string processing on that data :)

Larry Watanabe
  • 10,126
  • 9
  • 43
  • 46
0

If you're doing binary level processing, it is very low level and likely needs to be very efficient and have minimal dependencies/install requirements.

So I would go with C - handles bytes well - and you can probably google for some library packages that handle bytes.

Going with something like Erlang introduces inefficiencies, dependencies, and other baggage you probably don't want with a low-level library.

Larry Watanabe
  • 10,126
  • 9
  • 43
  • 46
  • Actually, speed is not that much of an issue for me, as I will mostly use it to "browse" some binary blob (e.g. a core dump, or data recorded from a stream) more or less interactively, with a size that can usually be counted in megabytes. You don't happen to know of any such library packages for C that are worth checking out? – Eric Hansander Jun 14 '09 at 19:31