1

I have a large binary file. I want to extract certain strings from it and copy them to a new text file.

For example, in:

D-wM-^?^@^@^@^@^@^@^@^Y^@^@^@^@^@^@^@M-lM-FM-MM-[o@^B^@M-lM-FM MM-[o@^B^@^@^@^@^@E7cacscKLrrok9bwC3Z64NTnZM-^G

I want to take the number '7' (after the @^@^@E) and every character after it stopping at the Z ('ignoring the M-^G).

I want to copy this 7cacscKLrrok9bwC3Z64NTnZ to a new file.

There will be multiple such strings in one file. The end will always be denoted by the M- (which I don't want copied). The start will always be denoted by a 7 (which I do want copied).

Unfortunately, my knowledge of grep, sed, etc, does not extend to this level. Can someone please suggest a viable way to achieve this?

cat -v filename | grep [7][A-Z,a-z] will show all strings with a '7' followed by a letter but that's not much.

Thank you.


I've noticed that my requirements are rather more complicated.

(I've performed the correct - I hope - formatting this time). Thanks to 'tshiono' for his (?) answer to the earlier submission.

I want to check the ending of a string and, if it ends in M-, grep another string that follows it (with junk in between). If the string does not end in M-, then I don't want it copied (let alone any other strings).

So what I would like is:

grep -a -Po "7[[:alnum:]]+(?=M-)" file_name and if the ending is M- then grep -a -Po "5x[[:alnum:]]+(?=\^)" file_name to copy the string that starts with 5x and ends with a ^.

In this example:

D-wM-^?^@^@^@^@^@^@^@^Y^@^@^@^@^@^@^@M-lM-FM-MM-[o@^B^@M-lM-FM MM-[o@^B^@^@^@^@^@E7cacscKLrrok9bwC3Z64NTnZM-^GwM-^?^@^@^@^@^@^@^@^Y^@^@^@^@^@^@^@M-lM-FM-MM-[o@^B^@M-lM5x8w09qewqlkcklwnlkewflewfiewjfoewnflwenfwlkfwelk^89038432nowefe

The outcome would be:

7cacscKLrrok9bwC3Z64NTnZ
5x8w09qewqlkcklwnlkewflewfiewjfoewnflwenfwlkfwelk

However, if the ending is not M- (more precisely, if the ending is ^S), then do not try the second grep and do not record anything at all.

In this example:

D-wM-^?^@^@^@^@^@^@^@^Y^@^@^@^@^@^@^@M-lM-FM-MM-[o@^B^@M-lM-FM MM-[o@^B^@^@^@^@^@E7cacscKLrrok9bwC3Z64NTnZ^SGwM-^?^@^@^@^@^@^@^@^Y^@^@^@^@^@^@^@M-lM-FM-MM-[o@^B^@M-lM5x8w09qewqlkcklwnlkewflewfiewjfoewnflwenfwlkfwelk^89038432nowefe

The outcome would be null (nothing copied) as the 7cacs... string ends in ^S.

Is grep the correct tool? Grep a file and if the condition in the grep command is 'yes' then issue a different grep command but if the condition is 'no' then do nothing.

Thanks again.


I have noticed one addition modification.

Can one add an OR command to the second part? Grep if the second string starts with 5x OR 6x?

In the example below, grep -aPo "7[[:alnum:]]+M-.*?5x[[:alnum:]]+\^" filename | grep -aPo "7[[:alnum:]]+(?=M-)|5x[[:alnum:]]+(?=\^)" will extract the strings starting with 7 and the strings starting with 5x.

How can one change the 5x to 5x or 6x?

D-wM-^?^@^@^@^@^@^@^@^Y^@^@^@^@^@^@^@M-lM-FM-MM-[o@^B^@M-lM-FM MM-[o@^B^@^@^@^@^@E7cacscKLrrok9bwC3Z64NTnZM-^GwM-^?^@^@^@^@^@^@^@^Y^@^@^@^@^@^@^@M-lM-FM-MM-[o@^B^@M-lM5x8w09qewqlkcklwnlkewflewfiewjfoewnflwenfwlkfwelk^89038432nowefe
D-wM-^?^@^@^@^@^@^@^@^Y^@^@^@^@^@^@^@M-lM-FM-MM-[o@^B^@M-lM-FM MM-[o@^B^@^@^@^@^@E7AAAAAscKLrrok9bwC3Z64NTnZM-^GwM-^?^@^@^@^@^@^@^@^Y^@^@^@^@^@^@^@M-lM-FM-MM-[o@^B^@M-lM6x8w09qewqlkcklwnlkewflewfiewjfoewnflwenfwlkfwelk^89038432nowefe

In this example, the desired outcome would be:

7cacscKLrrok9bwC3Z64NTnZ
5x8w09qewqlkcklwnlkewflewfiewjfoewnflwenfwlkfwelk
7AAAAAscKLrrok9bwC3Z64NTnZ
6x8w09qewqlkcklwnlkewflewfiewjfoewnflwenfwlkfwelk

UPDATE MARCH 09:

I need to create a series of complex grep (or perl) commands to extract strings from a series of binary files.

I need two strings from the binary file.

The first string will always start with a 1.

The first string will end with a letter or number. The next letter will always be a lower case k. I do not want this k character.

The difficulty is that the ending k will not always be the first k in the string. It might be the first k but it might not.

After the k, there is a second string. The second string will always start with an A or a B.

The ending of the second string will be in one of two forms: a) it will end with a space then display the first three characters from the first string in lower case followed by a ) b) it will end with a ^K then display the first three characters from the first string in lower case.

For example:

1pppsx9YPar8Rvs75tJYWZq3eo8PgwbckB4m4zT7Yg042KIDYUE82e893hY ppp)

Should be:

1pppsx9YPar8Rvs75tJYWZq3eo8Pgwbc and B4m4zT7Yg042KIDYUE82e893hY - delete the k and the space then ppp.

For example:

1zzzsx9YPkr8Rvs75tJYWZq3eo8PgwbckA2m4zT7Yg042KIDYUE82e893hY^Kzzz

Should be:

1zzzsx9YPkar8Rvs75tJYWZq3eo8Pgwbc and A4m4zT7Yg042KIDYUE82e893hY - delete the second k and the ^Kzzz.

In the second example, we see that the first k is part of the first string. It is the k before the A that breaks up the first and second strings.

I hope there is a super grep expert who can help! Many thanks!

twinkette
  • 13
  • 4

3 Answers3

4

If your grep supports -P option, would you please try:

grep -a -Po "7[[:alnum:]]+(?=M-)" file
  • The -a option forces grep to read the input as a text file.
  • The -P option enables the perl-compatible regex.
  • The -o option tells grep to print only the matched substring(s).
  • The pattern (?=M-) is a zero-width lookahead assertion (introduced in Perl) without including it in the result.

Alternatively you can also say with sed:

sed 's/M-/\n/g' file | sed -n 's/.*\(7[[:alnum:]]\+\).*/\1/p'
  • The first sed command splits the input file into miltiple lines by replacing the substring M- with a newline. It has two benefits: it breaks the lines to allow multiple matches with sed and excludes the unnecessary portion M- from the input.
  • The next sed command extracts the desired pattern from the input.

It assumes your sed accepts \n in the replacement, which is a GNU extension (not POSIX compliant). Otherwise please try (in case you are working on bash):

sed 's/M-/\'$'\n''/g' file | sed -n 's/.*\(7[[:alnum:]]\+\).*/\1/p'

[UPDATE]
(The requirement has been updated by the OP and the followings are solutions according to it.)

Let me assume the string which starts with 7 and ends with M- is always followed by another (no more and no less than one) string which starts with 5x and ends with ^ (ascii caret character) with junks in between.
Then would you please try the following:

grep -aPo "7[[:alnum:]]+M-.*?5x[[:alnum:]]+\^" file | grep -aPo "7[[:alnum:]]+(?=M-)|5x[[:alnum:]]+(?=\^)"
  • It executes the task in two steps (two cascaded greps).
  • The 1st grep narrows down the input data into the candidate substring which will include the desired two sequences and junks in between.
  • The regex .*? in between matches any (ascii or binary) characters except for a newline character. The trailing ? enables the shortest match which avoids the overrun due to the greedy nature of regex. The regex is intended to match junks in between.
  • The 2nd grep includes two regex's merged with a pipe | meaning logical OR. Then it extracts two desired sequences.

A potential problem of grep solution is that grep is a line oriented command and cannot include the newline character in the matched string. If a newline character is included in the junks in between (I'm not sure about the possibility), the above solution will fail. As a workaround, perl will provide flexible manipulations with binary data.

perl -0777 -ne '
    while (/(7[[:alnum:]]+)M-.*?(5x[[:alnum:]]+)\^/sg) {
        printf("%s\n%s\n", $1, $2);
    }
' file
  • The regex is mostly same as that of grep because the -P option of grep means perl-compatible.
  • It can capture multiple patterns at once in variables $1 and $2 hence just one regex is enough.
  • The -0777 option to the perl command tells perl to slurp all data at once.
  • The s option at the end the regex makes a dot match a newline character.
  • The g option enables the global (multiple) match.

[UPDATE2]
In order to make the regex match either 5x or 6x, replace 5x with (5|6)x.
Namely:

grep -aPo "7[[:alnum:]]+M-.*?(5|6)x[[:alnum:]]+\^" file | grep -aPo "7[[:alnum:]]+(?=M-)|(5|6)x[[:alnum:]]+(?=\^)"

As mentioned before, the pipe | means OR. The OR operator has the lowest priority in the evaluation, hence you need to enclose them with parens in this case.

If there is a possibility any other number than 5 or 6 may appear, it will be safer to put [[:digit:]] instead, which matches any one digit betweeen 0 and 9:

grep -aPo "7[[:alnum:]]+M-.*?[[:digit:]]x[[:alnum:]]+\^" file | grep -aPo "7[[:alnum:]]+(?=M-)|[[:digit:]]x[[:alnum:]]+(?=\^)"

[UPDATE3]
(Answering the OP's requirement on March 9th)

Let me start with a perl code which regex will be relatively easier to explain.

perl -0777 -ne 'while (/(1(.{3}).+)k([AB].*)[\013 ]\2/g){print "$1 $3\n"}' file

Output:

1pppsx9YPar8Rvs75tJYWZq3eo8Pgwbc B4m4zT7Yg042KIDYUE82e893hY
1zzzsx9YPkr8Rvs75tJYWZq3eo8Pgwbc A2m4zT7Yg042KIDYUE82e893hY

[Explanation of regex]

(1(.{3}).+)k([AB].*)[\013 ]\2
(                  start of the 1st capture group referred by $1 later
 1                 literal "1"
  (                start of the 2nd capture group referred by \2 later
   .{3}            a sequence of the identical three characters such as ppp or zzz
       )           end of the 2nd capture group
        .+         followed by any characters with "greedy" match which may include the 1st "k"
          )        end of the 1st capture group
           k       literal "k"
(                  start of the 3rd capture group referred by $3 later
 [AB].*            the character "A" or "B" followed by any characters
       )           end of the 3rd capture group
        [\013 ]    followed by ^K or a whitespace
               \2  followed by the capture group 2 previously assigned

When implementing it with grep, we will encounter a limitation of grep. Although we want to extract multiple patterns from the input file, the -e option (which can specify multiple search patterns) does not work with -P option. Then we need to split the regex into two patterns such as:

grep -Po "(1(.{3}).+)(?=k([AB].*)[\013 ]\2)" file
grep -Po "(1(.{3}).+)k\K([AB].*)(?=[\013 ]\2)" file

And the result will be:

1pppsx9YPar8Rvs75tJYWZq3eo8Pgwbc
1zzzsx9YPkr8Rvs75tJYWZq3eo8Pgwbc
B4m4zT7Yg042KIDYUE82e893hY
A2m4zT7Yg042KIDYUE82e893hY

Please be noted the order of output is not same as the order of appearance in the original file.

Another option will be to introduce ripgrep or rg which is a fast and versatile version of grep. You may need to install ripgrep with sudo apt install ripgrep or using other package handling tool. An advantage of ripgrep is it supports -r (replace) option in which you can make use of the backreferences:

rg -N -Po "(1(.{3}).+)k([AB].*)[\013 ]\2" -r '$1 $3' file

The -r '$1 $3' option prints the 1st and the 3rd capture groups and the result will be the same as perl.

tshiono
  • 21,248
  • 2
  • 14
  • 22
  • @twinkette Good to know it's working. If you feel comfortable with the answer, I'd be happy if you can accept it by clicking the check mark besides the answer. BR. – tshiono Feb 03 '20 at 01:54
  • I have actually edited the question rather substantially since I actually need something even more complicated. I don't know if checking your answer will "close" the question which I wouldn't want to do just yet. But I will certainly check your response. @tshiono – twinkette Feb 03 '20 at 20:15
  • Thank you for updating your question. Well understood. I've updated my answer accordingly. Would you please test it? I suppose it is common that an OP finds another requirement after testing the initial answer. Please do not hesitate to update your post. BR. – tshiono Feb 04 '20 at 00:09
  • I have made yet another modification @tshiono if you (or anyone) would like to take a look? – twinkette Feb 05 '20 at 12:20
  • Sure. Very interested :) – tshiono Feb 05 '20 at 12:27
  • Perfect! That's wonderful. I do appreciate your hard work and knowledge of the grep command. – twinkette Feb 07 '20 at 23:02
  • 1
    @RavinderSingh13 IMHO the combination of `grep` with `-Po` option *and* `regex` with `lookahead/lookbehind assertion` is very useful in extracting substrings out of data. – tshiono Feb 12 '20 at 05:57
  • @tshiono, yeah you are right, I was searching that part and learning and seeing yours and anubhava's lot of posts :) learning is in progress caption :) :) – RavinderSingh13 Feb 12 '20 at 06:06
  • @tshiono, Hi toshiono, I am sorry if I am bugging you here, could you please explain this statement `The s option at the end the regex makes a dot match a newline character` more please? Is it like sed's substitution s kind of? – RavinderSingh13 Feb 12 '20 at 06:14
  • 1
    @RavinderSingh13 sorry my explanation was not clear enough. In most languages the metacharacter `.` matches any single character but `Perl`'s dot is an exception. It does not match a newline character "\n" as a default behavior. `Perl`'s regex has several options and the `s` option (single-line mode) is one of them. In the single-line mode, the dot works to match "\n" as well. The single-line mode is mainly used to handle multiple lines which includes newlines in between as a single string. (Ha! the mode name is confusing.) – tshiono Feb 12 '20 at 06:36
  • 1
    @RavinderSingh13 the [link](https://stackoverflow.com/questions/22962588/understanding-perl-regular-expression-modifers-m-and-s) may be also informative. – tshiono Feb 12 '20 at 06:37
  • @tshiono, thank you nice info buddy, cheers. No need to say sorry your post is great thank you :) – RavinderSingh13 Feb 12 '20 at 06:59
  • @tshiono I have updated the question once again! I hope you have time to take a look at it. – twinkette Mar 09 '20 at 22:23
  • @twinkette I've updated my answer as UPDATE3. Hope it will help. As a side note it would have been preferable to post it as a new thread so that more people will pay attention to it. Of course I'd be happy if you can point me to the new thread with a comment like this in such a case. Cheers! – tshiono Mar 10 '20 at 02:07
  • @tshiono Sorry for the delay and thank you for your answer. I used ````strings```` in order to extract the ASCII characters from the binary file. There is one more aspect which is quite minor so I haven't created a new question (but I will if you ask me to): how can the last 3 characters be made case insensitive. The code is ````strings filename(s) | perl -0777 -ne 'while (/(1(.{3}).+)k([LK].*)[^K ]\2/g){print "$1 $3\n"}' > output.txt```` (I changed some of the letters a bit). So it wouldn't matter if the first letters were 'AAA' or 'aAa' or 'AAa' - the last three letters will always be 'aaa'. – twinkette Mar 18 '20 at 22:44
  • @tshiono At the moment the code only works if the first three letters are all lower case. And again - thanks so much for your help. I really appreciate your efforts. – twinkette Mar 18 '20 at 22:45
  • @twinkette a simple solution will be to add the `i` switch to the regex for the `case-insensitive match` such as `strings filename(s) | perl -0777 -ne 'while (/(1(.{3}).+)k([LK].*)[^K ]\2/gi){print "$1 $3\n"}' > output.txt`. It also makes the characters k, L, K to be case-insensitive and may overdetect the strings. If you need to keep k, L, and K to be case-sensitive, please let me know. The code will be a bit complicated. – tshiono Mar 19 '20 at 06:00
1

In the general case, you can use the strings utility to pluck out ASCII from binary files; then of course you can try to grep that output for patterns that you find interesting.

Many traditional Unix utilities like grep have internal special markers which might get messed up by binary input. For example, the character \xFF was used for internal purposes by some versions of GNU grep so you can't grep for that character even if you can figure out a way to represent it in the shell (Bash supports $'\xff' for example).

A traditional approach would be to run hexdump or a similar utility, and then grep that for patterns. However, more modern scripting languages like Perl and Python make it easy to manipulate arbitrary binary data.

perl -ne 'print if m/\xff\xff/' </dev/urandom
tripleee
  • 175,061
  • 34
  • 275
  • 318
0

This might work for you (GNU sed):

sed -En '/\n/!{s/M-\^G/\n/;s/7[^\n]*\n/\n&/};/^7[^\n]*/P;D' file

Split each line into zero or more lines that begin with 7 and end just before M-^G and only print such lines.

potong
  • 55,640
  • 6
  • 51
  • 83