4

The current commands I'm using to search some hex values (say 0A 8b 02) involve:

find . -type f -not -name "*.png" -exec xxd -p {} \; | grep "0a8b02" || xargs -0 -P 4

Is it possible to improve this given the following goals:

  • search files recursively
  • display the offset and filename
  • exclude certain files with certain extensions (above example will not search .png files)
  • speed: search needs to handle 200,000 files (around 50KB to 1MB) in a directly totaling ~2GB.

I'm not too confident if the xargs is working properly for 4 processors. Also I'm having difficulties printing the filename when grep finds a match since it is piped from xxd. Any suggestions?

mklement0
  • 382,024
  • 64
  • 607
  • 775
Helen Che
  • 1,951
  • 5
  • 29
  • 41
  • I would write a script for grepping a single binary (which prints the filename on success), and use that script in `find | xargs`. You're in zsh so it's hard to define functions in a subshell. If you're determined to have everything in one script, you can use bash instead, which allows you to export a function. – 4ae1e1 May 15 '15 at 19:19
  • So given what I currently have... it's impossible to even output the filename? – Helen Che May 15 '15 at 21:01
  • There'd be a fairly simple solution if the search byte sequences _never_ included `0xa` (i.e., newlines) - but it sounds like they can, right? Also, are you using _GNU_ utilities (Linux)? – mklement0 May 15 '15 at 21:15
  • @mklement0 No, sequences will never include `0xa`, I'm running this on OS X unfortunately. Will this be a problem? – Helen Che May 15 '15 at 21:20
  • 1
    Unfortunately, probably yes. However, perhaps installing GNU `grep` is an option for you. Have a look at my answer and let's continue the discussion there. – mklement0 May 15 '15 at 21:30

1 Answers1

4

IF:

  • you have GNU grep
  • AND the hex bytes you search for NEVER contain newlines (0xa)[1]
    • If they contain NUL (0x), you must provide the grep search string via a file (-f) rather than by direct argument.

the following command would get you there, using the example of searching for 0e 8b 02:

LC_ALL=C find . -type f -not -name "*.png" -exec grep -FHoab $'\x{0e}\x{8b}\x{02}' {} + |
  LC_ALL=C cut -d: -f1-2

The grep command produces output lines as follows:

<filename>:<byte-offset>:<matched-bytes>

which LC_ALL=C cut -d: -f1-2 then reduces to <filename>:<byte-offset>

The command almost works with BSD grep, except that the byte offset reported is invariably the start of the line that the pattern was matched on.
In other words: the byte offset will only be correct if no newlines precede a match in the file.
Also, BSD grep doesn't support specifying NUL (0x0) bytes as part of the search string, not even when provided via a file with -f.

  • Note that there'll be no parallel processing, but only a few grep invocations, based on using find's -exec ... +, which, like xargs, passes as many filenames as will fit on a command line to grep at once.
  • By letting grep search for the byte sequence directly, there is no need for xxd:
    • The sequence is specified as an ANSI C-quoted string, which means that the escape sequences are expanded to literals by the shell, enabling Grep to then search for the resulting string as a literal (via -F), which is faster.
      The linked article is from the bash manual, but they work in zsh (and ksh) too.
      • A GNU Grep alternative is to use -P (support for PRCEs, Perl-compatible regular expressions) with non-pre-expanded escape sequences, but this will be slower: grep -PHoab '\x{0e}\x{8b}\x{02}'
    • LC_ALL=C ensures that grep treats each byte as its own character without applying any encoding rules.
    • -F treats the search strings as a literal (rather than a regex)
    • -H prepends the relevant input filename to each output line; note that Grep does this implicitly when given more than 1 filename argument
    • -o only report matched strings (byte sequences), not the whole line (the concept of a line has no meaning in binary files anyway)[2]
    • -a treats binary files as if they were text files (without this, Grep would only print text Binary file <filename> matches for binary input files with matches)
    • -b reports the byte offsets of matches

If it's sufficient to find at most 1 match in a given input file, add -m 1.


[1] Newlines cannot be used, because Grep invariably treats newlines in a search-pattern string as separating multiple search patterns. Also, Grep is line-based, so you can't match across lines; GNU Grep's -null-data option to split the input by NUL bytes could help, but only if your search byte sequence doesn't also comprise NUL bytes; you'd also have to represent your byte values as escape sequences in a regex combined with -P - because you'll need to use escape sequence \n in lieu of actual newlines.

[2] -o is needed to make -b report the byte offset of the match as opposed to that of the beginning of the line (as stated, BSD Grep always does the latter, unfortunately); additionally, it is beneficial to only report the matches themselves here, as an attempt to print the entire line would result in unpredictably long output lines, given that there's no concept of lines in binary files; either way, however, outputting bytes from a binary file may cause strange rendering behavior in the terminal.

mklement0
  • 382,024
  • 64
  • 607
  • 775
  • Followed [this blog](https://www.topbug.net/blog/2013/04/14/install-and-use-gnu-command-line-tools-in-mac-os-x/) on installing GNU `grep`, I'm still having issues: a simple [`ggrep -P`](http://www.commandlinefu.com/commands/view/5959/grep-binary-hexadecimal-patterns) works, however I couldn't get your command to return anything from a simple search for `00`. I'm open to *any* solution using BSD `grep` if that's possible for you. It doesn't have to be one line, a script works fine for me. – Helen Che May 15 '15 at 23:04
  • Just to note homebrew installs gnu grep as `ggrep`. Running `ggrep --version` gives `(GNU grep) 2.21`. Is it also possible to display n bytes ahead/behind of the matched offset? – Helen Che May 15 '15 at 23:09
  • @VeraWang: It's a limitation of the _shell_ that you can't pass values that include NUL (`0x0`) as an _argument_. With _GNU_ Grep you can work around that by saving the byte sequence to a _file_ and then using that file with `-f`. Sadly, this technique does _not_ work with _BSD_ Grep. – mklement0 May 15 '15 at 23:30
  • @VeraWang: Re displaying n bytes of context: from what I can tell, the context features invariably relate to _lines of text_, not bytes. – mklement0 May 15 '15 at 23:32
  • @VeraWang: Here's an example of how to generate a single NUL, which you can redirect to a file: `dd if=/dev/zero bs=1 count=1 2>/dev/null`. – mklement0 May 15 '15 at 23:34
  • Tangential question: is there a cleaner way of excluding multiple file extensions? I'm resorting to `! -name "*.ext"` for each one. – Helen Che May 16 '15 at 08:46
  • 1
    @VeraWang: You can use the `-regex` `-iregex` primaries, which match the entire path against a regular expression (case-insensitively in the latter case). If you combine that with `-E` to enable _extended_ regular expressions, you can use alternation (`|`), such as in the following example, which excludes both `.txt` and `.bak` files: `find -E . ! -iregex '.*\.(bak|txt)$'` – mklement0 May 16 '15 at 12:56