0

I am trying to search for a very specific string in a folder full of binary files. The goal is to have the program open each binary file, search for the specific string and then print out file that the string is located in.

I think I have something that is close to working, but is not there yet. I was playing the bytes on the string I want to search but I still am not finding anything. I have also tried struct.uppack but that didn't seem to work either.

Any help is much appreciated. Thank you for your time.

Code:

import os

toSearch =bytes("find me","unicode_escape")
folderToSearch = "C:\\dir\\for\\bin\\files"
for root, dirs, files in os.walk(folderToSearch):
    for file in files:
        if file.endswith(".ROM"):
            with open(root+"\\"+file,"rb") as binary_file:
                fileContent = binary_file.read()
                if fileContent.find(toSearch) != -1:
                    print(os.path.join(root, file))
laxer
  • 720
  • 11
  • 41
  • `string.find` returns -1 if the string is not found, so you need to test for that condition rather than the boolean truth. E.g. `if fileContent.find(toSearch) != -1:` – Tom Dalton Aug 28 '19 at 16:48
  • @TomDalton Thank you that deferentially solves my issue of it just printing everything, but it still not quite right. It is not finding the string. Which I suspect it is because of the encodings – laxer Aug 28 '19 at 16:51
  • So what is the encoding of the file? Have you looked at the raw bytes in the files to compare with what you're expecting? – Tom Dalton Aug 28 '19 at 17:12
  • @TomDalton It says its Unicode – laxer Aug 28 '19 at 17:43
  • Unicode and character/binary encodings is a tricky subject and ppl often get confused. Have a read of https://stackoverflow.com/questions/643694/what-is-the-difference-between-utf-8-and-unicode as it should make the concepts a bit clearer. UTF-8 is pretty standard these days (at least for Western text, somewhat like ASCII was historically). – Tom Dalton Aug 29 '19 at 18:39

2 Answers2

0

This might help you do some debugging. (I also refactored your code to use pathlib instead of os to make it cleaner).

from pathlib import Path

encoding = "unicode_escape"
search_dir = Path("C:\\dir\\for\\bin\\files")
search_bytes = bytes("find me", encoding)
roms = {"match": [], "no_match": []}

for rom_file in search_dir.glob("**/*.ROM"):
    with open(rom_file, 'rb') as rom_handle:
        rom_contents = rom_handle.read()
            match = "match" if (search_bytes in rom_contents) else "no_match"
            roms[match].append({
                str(rom_file.resolve()): rom_contents
            })

If you run this, you can manually inspect the bytes that are read in for matching/non-matching results.

PMende
  • 5,171
  • 2
  • 19
  • 26
  • When I try to run that it pops up with an error that says `ValueError: binary mode doesn't take an encoding argument` so I don't think it can be specified this way – laxer Aug 28 '19 at 17:40
  • @laxer Thanks for the comment. Should have tested it, first. :) I'll edit that out. – PMende Aug 28 '19 at 17:55
  • Unfortnatually that didn't seem to find the value. – laxer Aug 28 '19 at 18:05
  • 1
    @laxer I would suggest manually creating a minimal binary file that you expect to be picked up, and see what the bytes actually look like. Then look at your real examples, and figure out what the difference is. – PMende Aug 28 '19 at 18:07
0

I'm not sure why using find() doesn't work, but the following does on my system:

import os

toSearch = b"find me"
folderToSearch = "C:\\dir\\for\\bin\\files"

for root, dirs, files in os.walk(folderToSearch):
    for file in files:
        if file.endswith(".ROM"):
            print(f'checking file {file}')
            filepath = os.path.join(root, file)
            with open(filepath, "rb") as binary_file:
                fileContent = binary_file.read()
                if toSearch in fileContent:
                    print(filepath)

print('done')
martineau
  • 119,623
  • 25
  • 170
  • 301
  • Unfortnatually that didn't work. It still did not find the value. – laxer Aug 28 '19 at 17:43
  • laxer: Hmm, well, as I said it does on my system using a simple testcase I manually created more-or-less the way @PMende suggested that you do in a comment under his/her answer. – martineau Aug 28 '19 at 18:43