How to find byte sequence in file?

Question

I have a binary file, in which I need to change certain bit.

That bit's byte's address is relative to some byte sequence (some ASCII string):

content = array('B')
with open(filename, mode="r+b") as file:
    content.fromfile(file, os.fstat(file.fileno()).st_size)
    abc = [ord(letter) for letter in "ABC"]
    i = content.index(abc) // ValueError: array.index(x): x not in list
    content[i + 0x16] |= 1
    content.tofile(file)

However as I must confess to my shame, that after Googling far and wide, I couldn't find the method to get the index of that "ABC" string...

Sure, I can write a function that does it with loops, but I can't believe there is no one-liner (OK, even two...) that accomplishes it.

How can it be done?

Your "as efficiently as possible" demand is poorly constrained. Why are you even using Python and not C or assembly if speed is the only concern? — timgeb, Mar 21 '18 at 10:27
@timgeb, I removed that constraint. It's not the main issue here. However if you insist on an answer, then it's a build script, and it must remain a script, not compiled code, and there are many other files to change, while making sure the build doesn't become too slow. Basically I just wanted to avoid using immutable sequences, I want to change the data in-place. — Tar, Mar 21 '18 at 10:36
No worries. The problem with "as efficiently as possible" is that you might get unpythonic answers that sacrifice a lot of readability for nanoseconds. Instead, try to describe the required level of efficiency in a more constrained manner. — timgeb, Mar 21 '18 at 10:38
What is your intended purpose of `"ABC".encode(hex)`? It is a Python 2 method and has been called [not nice](https://stackoverflow.com/a/13437894/2564301) in 2012 ... Anyway: since it converts `ABC` to `414243`, are you **sure** the text `414243` should appear somewhere inside your binary? Or am I misunderstanding its purpose here? — Jongware, Mar 21 '18 at 11:04
@usr2564301, yes, that's the issue, `414243` doesn't appear as string, but as byte-sequence, meaning in some place in the file there is a sequence of `[0x41, 0x42, 0x43]`, I don't know how to: 1: generate that sequence from the string, and 2: how to locate that byte-sequence inside the file's content. I can overcome issue #1 with `abc = [ord(letter) for letter in "ABC"]`, but then #2 still fails. — Tar, Mar 21 '18 at 11:23

score 0 · Answer 1 · 2018-03-21T14:14:56.183

Not sure if this is the most Pythonic way, but this works. In this file

$ cat so.bin    
���ABC̻�X��w
$ hexdump so.bin
0000000 eeff 41dd 4342 bbcc 58aa 8899 0a77     
000000e

Edit: New solution starts here.

import string

char_ints = [ord(c) for c in string.ascii_letters]

with open("so.out.bin", "wb") as fo:
    with open("so.bin", "rb") as fi:

        # Read bytes but only keep letters.
        chars = []
        for b in fi.read():
            if b in char_ints:
                chars.append(chr(b))
            else:
                chars.append(" ")

        # Search for 'ABC' in the read letters.
        pos = "".join(chars).index("ABC")

        # We now know the position of the intersting byte.
        pos_x = pos + len("ABC") + 3 # known offset

        # Now copy all bytes from the input to the output, ...
        fi.seek(0)
        i = 0
        for b in fi.read():
            # ... but replace the intersting byte.
            if i == pos_x:
                fo.write(b"Y")
            else:
                fo.write(bytes([b]))
            i = i + 1

Edit: New solution ends here.

I want to get the X four positions after ABC. A little state keeping locates the position of ABC, skips the offset, prints the interesting bytes.

foundA = False
foundB = False
foundC = False
found = False
offsetAfterC = 3
lengthAfterC = 1

with open("so.bin", "rb") as f:
    pos = 0
    for b in f.read():
        pos = pos + 1
        if not found:
            if b == 0x41:
                foundA = True
            elif foundA and b == 0x42:
                foundB = True
            elif foundA and foundB and b == 0x43:
                foundC = True
            else:
                foundA, foundB, foundC = False, False, False

        if foundA and foundB and foundC:
            found = True
            break

    f.seek(0)
    i = 0
    while i < pos + offsetAfterC:
        b = f.read(1)
        i = i + 1
    while i < pos + offsetAfterC + lengthAfterC:
        b = f.read(1)
        print(hex(int.from_bytes(b, byteorder="big")))
        i = i + 1

Output:

0x58

Thank you :-) but I was hoping more for a one-liner kind of code... I have a feeling that this is an overkill and that there is a much simpler solution. — Tar, Mar 21 '18 at 13:48
@Tar I've added a new solution. It is not a one-liner but it is nicer. — , Mar 21 '18 at 14:16

How to find byte sequence in file?

1 Answers1