Basically I want a "multiline grep that takes binary strings as patterns".
For example:
printf '\x00\x01\n\x02\x03' > big.bin
printf '\x01\n\x02' > small.bin
printf '\x00\n\x02' > small2.bin
Then the following should hold:
small.bin
is contained inbig.bin
small2.bin
is not contained inbig.bin
I don't want to have to convert the files to ASCII hex representation with xxd
as shown e.g. at: https://unix.stackexchange.com/questions/217936/equivalent-command-to-grep-binary-files because that feels wasteful.
Ideally, the tool should handle large files that don't fit into memory.
Note that the following attempts don't work.
grep -f
matches where it shouldn't because it must be splitting newlines:
grep -F -f small.bin big.bin
# Correct: Binary file big.bin matches
grep -F -f small2.bin big.bin
# Wrong: Binary file big.bin matches
Shell substitution as in $(cat)
fails because it is impossible to handle null characters in Bash AFAIK, so the string just gets truncated at the first 0
I believe:
grep -F "$(cat small.bin)" big.bin
# Correct: Binary file big.bin matches
grep -F "$(cat small2.bin)" big.bin
# Wrong: Binary file big.bin matches
A C question has been asked at: How can i check if binary file's content is found in other binary file? but is it possible with any widely available CLI (hopefully POSIX, or GNU coreutils) tools?
Notably, implementing an non-naive algorithm such as Boyer-Moore is not entirely trivial.
I can hack up a working Python one liner as follows, but it won't work for files that don't fit into memory:
grepbin() ( python -c 'import sys;sys.exit(not open(sys.argv[1]).read() in open(sys.argv[2]).read())' "$1" "$2" )
grepbin small.bin big.bin && echo 1
grepbin small2.bin big.bin && echo 2
I could also find the following two tools on GitHub:
https://github.com/tmbinc/bgrep in C, installable with (amazing :-)):
curl -L 'https://github.com/tmbinc/bgrep/raw/master/bgrep.c' | gcc -O2 -x c -o /usr/local/bin/bgrep -
https://github.com/gahag/bgrep in Rust, installable with:
cargo install bgrep
but they don't seem so support taking the pattern from a file, you provide the input as hex ASCII on the command line. I could use:
bgrep $(xxd -p small.bin | tr -d '\n') big.bin
since it does not matter as much if the small file gets converted with xxd
, but it's not really nice.
In any case, if I were to implement the feature, I'd likely it to the Rust library above.
bgrep is also mentioned at: How does bgrep work?
Tested on Ubuntu 20.10.