1

I want to get a byte offset of a string pattern from a binary file on embedded linux platform.

If I can use "grep -b" option, It would be best way but It is not supported on my machine.

machine does not support

ADDR=`grep -oba <pattern string> <file path> | cut -d ":" -f1`

Here the manual of grep command on the machine.

root# grep --help

BusyBox v1.29.3 () multi-call binary.

Usage: grep \[-HhnlLoqvsriwFE\] \[-m N\] \[-A/B/C N\] PATTERN/-e PATTERN.../-f FILE \[FILE\]...

Search for PATTERN in FILEs (or stdin)

        -H      Add 'filename:' prefix
        -h      Do not add 'filename:' prefix
        -n      Add 'line_no:' prefix
        -l      Show only names of files that match
        -L      Show only names of files that don't match
        -c      Show only count of matching lines
        -o      Show only the matching part of line
        -q      Quiet. Return 0 if PATTERN is found, 1 otherwise
        -v      Select non-matching lines
        -s      Suppress open and read errors
        -r      Recurse
        -i      Ignore case
        -w      Match whole words only
        -x      Match whole lines only
        -F      PATTERN is a literal (not regexp)
        -E      PATTERN is an extended regexp
        -m N    Match up to N times per file
        -A N    Print N lines of trailing context
        -B N    Print N lines of leading context
        -C N    Same as '-A N -B N'
        -e PTRN Pattern to match
        -f FILE Read pattern from file

Since that option isn't available, I'm looking for an alternative.

the combination of hexdump and grep can be also useful

such as

ADDR=`hexdump <file path> -C | grep <pattern string> | cut -d' ' -f1`

But if pattren spans multiple lines, it will not be found.

Is there a way to find the byte offset of a specific pattern with a Linux command?

1q2w3e
  • 21
  • 2
  • As an aside, probably [avoid upper case for the names of your private variables](https://stackoverflow.com/questions/673055/correct-bash-and-shell-script-variable-capitalization) and prefer modern `$(command substitution)` syntax over legacy backticks. – tripleee Nov 11 '22 at 08:50

2 Answers2

1

Something like this?

hexdump -C "$file" |
awk -v pattern="$pattern" 'residue { matched = ($0 ~ "\\|" residue) 
  if (matched) print $1; residue = ""; if (matched) next }
$0 ~ pattern { print $1 }
{ for(i=length(pattern)-1; i>0; i--)
  if ($0 ~ substr(pattern, 1, i) "\\|$") { residue=substr(pattern, i+1); break } }'

The offset is just the first field from the hexdump output; if you need the precise location of the match, this requires some additional massaging to figure out the offset to add to the address, or subtract if it was wrapped.

Briefly tested in a clean-slate Busybox Docker container where hexdump -C output looks like this:

/ # hexdump -C /etc/resolv.conf 
00000000  23 20 44 4e 53 20 72 65  71 75 65 73 74 73 20 61  |# DNS requests a|
00000010  72 65 20 66 6f 72 77 61  72 64 65 64 20 74 6f 20  |re forwarded to |
00000020  74 68 65 20 68 6f 73 74  2e 20 44 48 43 50 20 44  |the host. DHCP D|
00000030  4e 53 20 6f 70 74 69 6f  6e 73 20 61 72 65 20 69  |NS options are i|
00000040  67 6e 6f 72 65 64 2e 0a  6e 61 6d 65 73 65 72 76  |gnored..nameserv|
00000050  65 72 20 31 39 32 2e 31  36 38 2e 36 35 2e 35 0a  |er 192.168.65.5.|
00000060  20                                                | |
tripleee
  • 175,061
  • 34
  • 275
  • 318
1

Set the pattern as the record separator in awk. The offset of the occurrence is the length of the first record. BusyBox awk treats RS as an extended regular expression, so add backslashes before any of .[]\*+?^$ in the pattern string.

<myfile.bin awk -v RS='pattern' '{print length($0); exit}'

If the pattern contains a null byte, you need a little extra work. Use tr to exchange null bytes with some byte value that doesn't appear in the pattern. For example, if the pattern's hex dump is 00002a61:

<myfile.bin tr '\0!' '!\0' | awk -v RS='!!-A' '{print length($0); exit}'

If the pattern is not found, this prints the length of the whole file. So if you aren't sure whether the pattern is present, you need again some extra work. Append some text that can't be part of a pattern match to the file, so that you know that if there's a match, it won't be at the very end of the file. Then, if the pattern is present, the file will contain at least two records. But if the pattern is not present, the file only contains the first record (without a record separator after it).

{ cat myfile.bin; echo garbage; } |
LC_ALL=C awk -v RS='pattern' '
    NR==1 {n = length($0)}
    NR==2 {print n; found = 1; exit}
    END {exit !found}
'

LC_ALL=C forces awk to use a single-byte locale with no invalid characters. It's necessary if your ambient locale is multibyte and your awk implementation is locale-aware (e.g. mawk or gawk), and harmless otherwise (e.g. with BusyBox).

Gilles 'SO- stop being evil'
  • 104,111
  • 38
  • 209
  • 254
  • Unfortunately, this method doesn't work for me: it gives me an offset that is far less than the one given by `grep -oba`... perhaps that `awk` forgets to count each `\n` or `\r` in the binary input file? At least there is something with binary & `awk` processing it... – Anthony O. Jul 10 '23 at 15:36
  • 1
    @AnthonyO. Maybe your implementation of awk (which one is it anyway?) doesn't process null bytes? Make sure to replace them by some other byte that isn't in the pattern. Also, make sure to run in the C locale (`LC_ALL=C awk …`) if your awk supports multibyte encodings. – Gilles 'SO- stop being evil' Jul 10 '23 at 21:27
  • Wow thank you, if I specify `LC_ALL=C awk ...` I obtain the same result as `grep -bao`! I hope that it will work that way with the BusyBox implementation that I will use for my real usage – Anthony O. Jul 11 '23 at 07:58
  • 1
    @AnthonyO. It's not necessary for BusyBox since its awk doesn't handle UTF-8. But I've added a note to my answer for the sake of other implementations. – Gilles 'SO- stop being evil' Jul 11 '23 at 10:10