1

Basically I want a "multiline grep that takes binary strings as patterns".

For example:

printf '\x00\x01\n\x02\x03' > big.bin
printf '\x01\n\x02' > small.bin
printf '\x00\n\x02' > small2.bin

Then the following should hold:

  • small.bin is contained in big.bin
  • small2.bin is not contained in big.bin

I don't want to have to convert the files to ASCII hex representation with xxd as shown e.g. at: https://unix.stackexchange.com/questions/217936/equivalent-command-to-grep-binary-files because that feels wasteful.

Ideally, the tool should handle large files that don't fit into memory.

Note that the following attempts don't work.

grep -f matches where it shouldn't because it must be splitting newlines:

grep -F -f small.bin big.bin
# Correct: Binary file big.bin matches
grep -F -f small2.bin big.bin
# Wrong: Binary file big.bin matches

Shell substitution as in $(cat) fails because it is impossible to handle null characters in Bash AFAIK, so the string just gets truncated at the first 0 I believe:

grep -F "$(cat small.bin)" big.bin
# Correct: Binary file big.bin matches
grep -F "$(cat small2.bin)" big.bin
# Wrong: Binary file big.bin matches

A C question has been asked at: How can i check if binary file's content is found in other binary file? but is it possible with any widely available CLI (hopefully POSIX, or GNU coreutils) tools?

Notably, implementing an non-naive algorithm such as Boyer-Moore is not entirely trivial.

I can hack up a working Python one liner as follows, but it won't work for files that don't fit into memory:

grepbin() ( python -c 'import sys;sys.exit(not open(sys.argv[1]).read() in open(sys.argv[2]).read())' "$1" "$2" )
grepbin small.bin big.bin && echo 1
grepbin small2.bin big.bin && echo 2

I could also find the following two tools on GitHub:

but they don't seem so support taking the pattern from a file, you provide the input as hex ASCII on the command line. I could use:

bgrep $(xxd -p small.bin | tr -d '\n') big.bin

since it does not matter as much if the small file gets converted with xxd, but it's not really nice.

In any case, if I were to implement the feature, I'd likely it to the Rust library above.

bgrep is also mentioned at: How does bgrep work?

Tested on Ubuntu 20.10.

Ciro Santilli OurBigBook.com
  • 347,512
  • 102
  • 1,199
  • 985
  • 1
    `rep -f matches where it shouldn't because it must be splitting newlines:` and also it's parsing regex. `grep "$(cat small.bin)"` fails not only for zero bytes. `grep` expects a regex. Note that your `is it possible with any widely available CLI` is falling into "seeking recommendation of tools" bin. – KamilCuk Jan 11 '21 at 22:27
  • @KamilCuk true, added `-F` for non regex. If it closes, I post in other places, the usual procedure. – Ciro Santilli OurBigBook.com Jan 11 '21 at 22:40
  • `"$(cat file-with-nulls)"` is setting yourself up for failure, since NULs can't be stored in C strings, and all strings in bash are NUL-delimited C strings. For that matter, I'd be very surprised -- nay, astonished -- if `grep` used strings capable of containing NUL literals. – Charles Duffy Jan 11 '21 at 22:42
  • ...it'd be a more reasonable start to build a version of your Python code that did windowing to search through the file incrementally (obvs., needing some special-case handling around chunk boundaries; but none of that seems particularly tricky so long as a maximum size can be enforced). – Charles Duffy Jan 11 '21 at 22:45
  • `grep` loads the whole regex pattern into memory. I don't think there's a tool that searches a binary byte stream (not byte buffer...) within a binary file. If the small file fits in memory and the large file is huge, you can write a simple C program that reads the small file, uses its size as a block size, reads blocks from the large file, and tries to find the small file with `memmem()`. – root Jan 12 '21 at 07:59

1 Answers1

4

How to check if a binary file is contained inside another binary from the Linux command line?

The very POSIX portable way would be to use od to convert to hex and then check for substring with grep, along with some sed scripting in between.

The usual normal portable way, would be to use xxd instead of od:

xxd -p small.bin | tr -d '[ \n]' > small.bin2
xxd -p big.bin | tr -d '[ \n]' > big.bin2
grep -F -f small.bin2 big.bin2

which works fine tested in docker on alpine with busybox.

But:

I don't want to have to convert the files to ASCII hex representation with xxd as shown

then you can't work with binary files in shell. Pick another language. Shell is specifically created to parse nice looking human readable strings - for anything else, it's utterly unpleasant and for files with zero bytes xxd is the first thing you type.

I can hack up a working Python one liner as follows,

awk is also POSIX and available everywhere - I believe someone more skilled in awk may come and write the exact 1:1 of your python script, but:

but it won't work for files that don't fit into memory:

So write a different algorithm, that will not do that.

Overall, when giving the constraint of not using xxd (or od) to convert a binary file with zero bytes to it's hex representation:

is it possible with any widely available CLI (hopefully POSIX, or GNU coreutils) tools?

No. Write your own program for that. You may also write it in perl, it's sometimes available on machines that don't have python.

KamilCuk
  • 120,984
  • 8
  • 59
  • 111