39

I'm reading in a binary file (a jpg in this case), and need to find some values in that file. For those interested, the binary file is a jpg and I'm attempting to pick out its dimensions by looking for the binary structure as detailed here.

I need to find FFC0 in the binary data, skip ahead some number of bytes, and then read 4 bytes (this should give me the image dimensions).

What's a good way of searching for the value in the binary data? Is there an equivalent of 'find', or something like re?

jww
  • 97,681
  • 90
  • 411
  • 885
Parand
  • 102,950
  • 48
  • 151
  • 186
  • 1
    have you ever looked into imagick? IIRC there is also a python library for it. – txwikinger Jul 10 '10 at 00:44
  • 1
    I have, and it works great, but it's quite heavy for just finding the dimensions of the file. – Parand Jul 10 '10 at 00:50
  • 2
    you should use a module appropriate for something like this http://snippets.dzone.com/posts/show/1021 –  Jul 10 '10 at 02:31

8 Answers8

30

You could actually load the file into a string and search that string for the byte sequence 0xffc0 using the str.find() method. It works for any byte sequence.

The code to do this depends on a couple things. If you open the file in binary mode and you're using Python 3 (both of which are probably best practice for this scenario), you'll need to search for a byte string (as opposed to a character string), which means you have to prefix the string with b.

with open(filename, 'rb') as f:
    s = f.read()
s.find(b'\xff\xc0')

If you open the file in text mode in Python 3, you'd have to search for a character string:

with open(filename, 'r') as f:
    s = f.read()
s.find('\xff\xc0')

though there's no particular reason to do this. It doesn't get you any advantage over the previous way, and if you're on a platform that treats binary files and text files differently (e.g. Windows), there is a chance this will cause problems.

Python 2 doesn't make the distinction between byte strings and character strings, so if you're using that version, it doesn't matter whether you include or exclude the b in b'\xff\xc0'. And if your platform treats binary files and text files identically (e.g. Mac or Linux), it doesn't matter whether you use 'r' or 'rb' as the file mode either. But I'd still recommend using something like the first code sample above just for forward compatibility - in case you ever do switch to Python 3, it's one less thing to fix.

David Z
  • 128,184
  • 27
  • 255
  • 279
  • 18
    If it's a really big file, it's not such a good idea to read it into a string all at once. – icktoofay Jul 10 '10 at 00:51
  • 3
    I doubt it's so big it's going to be a problem. – Chris B. Jul 10 '10 at 00:52
  • 3
    Since I'm only looking for the first frame I'll likely be able to read some small part of the file and process that instead of reading the whole file. – Parand Jul 10 '10 at 00:55
  • @icktoofay: good point, but I would point out that you can do exactly what Parand is saying, just read the first N bytes and search those. If you did have to search all of a large file for a byte sequence, it could be done iteratively so you wouldn't have to keep the whole thing in memory at once, but the code would be a little more involved, and I didn't think it'd be necessary to get into that here. – David Z Jul 10 '10 at 01:26
  • Exactly. I was just saying that it would be better to read/scan it in small chunks. – icktoofay Jul 10 '10 at 01:28
  • Python generators are perfect to process input streams. They make the code as simple as if it was reading everything at once without actually doing it. – MarcH Aug 22 '13 at 22:40
  • @ChrisB. : I'm trying to search a 70 megabyte firmware binary from upload data on my Linux microcontroller. OOM-killer kills the process. So yes, it is indeed a problem. – Janne Paalijarvi May 18 '22 at 20:35
  • 1
    @JannePaalijarvi Yes, if you have a different problem than the OP, than the solution which works for the OP may not work for you. My comment is relevant to the problem as described, not yours. – Chris B. May 19 '22 at 00:12
  • @ChrisB. Yes, you are right. I apologize for my outburst. – Janne Paalijarvi May 20 '22 at 16:55
12

Instead of reading the entire file into memory, searching it and then writing a new file out to disk you can use the mmap module for this. mmap will not store the entire file in memory and it allows for in-place modification.

#!/usr/bin/python

import mmap

with open("hugefile", "rw+b") as f:
    mm = mmap.mmap(f.fileno(), 0)
    print mm.find('\x00\x09\x03\x03')
synthesizerpatel
  • 27,321
  • 5
  • 74
  • 91
10

The bitstring module was designed for pretty much this purpose. For your case the following code (which I haven't tested) should help illustrate:

from bitstring import ConstBitStream
# Can initialise from files, bytes, etc.
s = ConstBitStream(filename='your_file')
# Search to Start of Frame 0 code on byte boundary
found = s.find('0xffc0', bytealigned=True)
if found:
    print("Found start code at byte offset %d." % found[0])
    s0f0, length, bitdepth, height, width = s.readlist('hex:16, uint:16, 
                                                        uint:8, 2*uint:16')
    print("Width %d, Height %d" % (width, height))
Scott Griffiths
  • 21,438
  • 8
  • 55
  • 85
  • So `Bits.find` returns just a boolean and sets the `Bits.bytepos` attribute? Perhaps in the module documentation you should warn that `bitstring` is not thread-safe (not that it matters in this answer, of course). – tzot Jul 11 '10 at 09:08
  • @ΤΖΩΤΖΙΟΥ: Yes you have a good point. I don't find it surprising that mutating methods or reading methods aren't thread safe, but using 'find' on a bit-wise immutable object could reasonably be expected to be. To be honest it's never cropped up before but it is something to think about... – Scott Griffiths Jul 12 '10 at 07:08
  • Just an idea: `find` could return an object with all necessary information, à la `re.match` and `re.search`. You could have this “BitMatch” class be a subclass of `bool`, for backwards compatibility. – tzot Jul 12 '10 at 07:33
  • @ΤΖΩΤΖΙΟΥ: Thanks, that's a reasonable idea although I'm in a good position to break backward compatibility slightly and maybe just have it return the bit position as a single item tuple if found or an empty tuple if not found. I guess anything's better than returning -1 if not found :) – Scott Griffiths Jul 12 '10 at 16:35
5

The re module does work with both string and binary data (str in Python 2 and bytes in Python 3), so you can use it as well as str.find for your task.

Andrey Vlasovskikh
  • 16,489
  • 7
  • 44
  • 62
5

In Python 3.x you can search a byte string by another byte string like this:

>>> byte_array = b'this is a byte array\r\n\r\nXYZ\x80\x04\x95 \x00\x00\x00\x00\x00'
>>> byte_array.find('\r\n\r\n'.encode())
20
>>>
caleb
  • 2,687
  • 30
  • 25
3

The find() method should be used only if you need to know the position of sub, if not, you can use the in operator, for example:

with open("foo.bin", 'rb') as f:
    if b'\x00' in f.read():
        print('The file is binary!')
    else:
        print('The file is not binary!')
kenorb
  • 155,785
  • 88
  • 678
  • 743
  • 2
    This did it for me - I was trying to compare a string to a byte string. All I had to do was put the b in front of my search term and it was found within the byte string. – pa1983 Aug 18 '16 at 10:01
2

Well, obviously there is PIL The Image module has size as an attribute. If you are wanting to get the size exactly how you suggest and without loading the file you are going to have to go through it line by line. Not the nicest way to do it but it would work.

fridder
  • 150
  • 5
1

For Python >=3.2:

import re

f = open("filename.jpg", "rb")
byte = f.read()
f.close()

matchObj = re.match( b'\xff\xd8.*\xff\xc0...(..)(..).*\xff\xd9', byte, re.MULTILINE|re.DOTALL)
if matchObj:
    # https://stackoverflow.com/q/444591
    print (int.from_bytes(matchObj.group(1), 'big')) # height
    print (int.from_bytes(matchObj.group(2), 'big')) # width
TAbdiukov
  • 1,185
  • 3
  • 12
  • 25
kissson
  • 118
  • 1
  • 8