1

I'm trying to use a regex to format some binary from xxd -b, but to demonstrate this simply I'll show you what I expect to happen:

Regex to delete: /1x|1.*/

Text: 1x21y3333333313333 -> 2

Where all occurrences of 1x are deleted, then everything starting at the first 1 that shows up should be deleted. It should be immediately obvious what's going on, but if it's not, play with this. The key is that if 1x is matched, the rest of the pattern should be aborted.

Here is the output from echo "AA" | xxd -b (the bindump of AA\n):

0000000: 01000001 01000001 00001010                             AA.

My goal is to 1. delete the first 0 for every byte (ascii = 7 bits) and 2. delete the rest of the string so only the actual binary is kept. So I have piped it into sed 's/ 0//g':

0000000:100000110000010001010                             AA.

Adding the second step, sed -E 's/ 0| .*//g':

0000000:

Obviously, I expect to instead get:

0000000:100000110000010001010

Things I've tried but haven't done the job:

  • xxd can take -g0 to merge the columns, but it retains the first zero in every byte (characters each take up a byte, not 7 bits)
  • -r

I will use perl instead in the meantime, but this behaviour baffles me and maybe there's a reason (lesson) here?

Unihedron
  • 10,902
  • 13
  • 62
  • 72

3 Answers3

2

If I understand your question correctly, this produces what you want:

$ echo "AA" | xxd -b | sed -E 's/ 0|  .*//g'
00000000:100000110000010001010

The key change here is the use of two blanks in front of .* so that this only matches the part that you want to remove.

Alternatively, we can remove blank-zero first:

$ echo "AA" | xxd -b | sed -E 's/ 0//g; s/ .*//'
00000000:100000110000010001010
John1024
  • 109,961
  • 14
  • 137
  • 171
  • 1
    I like the second snippet, it made me slap myself. Instead of doing the alternations in a regex, I could just split it! – Unihedron Mar 30 '19 at 22:12
1

Try the following:

 s/ 0| [^0].*//g

The reason of the seen behavior is that POSIX rules the engines to follow the longest possible match standard. So as long as the second side of alternation is longer than first, even being second in order, it matches earlier.

revo
  • 47,783
  • 14
  • 74
  • 117
  • Oh my god, you're right. After looking up "posix longest match" and jumping through two links, apparently it says so [here](http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html): "If the pattern permits a variable number of matching characters and thus there is more than one such sequence starting at that point, the longest such sequence is matched. [...] the ERE "(wee|week)(knights|night)" matches all ten characters of the string "weeknights"." Definitely counter-intuitive after having used a lot of modern regexes... – Unihedron Mar 30 '19 at 22:10
  • Although this answer fits the given input string, it will fail if a sequence starts e.g. with `1`. The right way is the second approach of John1024's. – revo Mar 30 '19 at 22:25
  • @revo My update to https://stackoverflow.com/a/216228/6309 was not complete enough? – VonC Mar 31 '19 at 13:19
  • @VonC It was. I wanted to draw more attention to involve more people in. Unfortunately, I forgot to give the bounty at the end of period. – revo Mar 31 '19 at 14:17
0

tried on gnu sed

sed -E 's/\s+(0|[a-z.]+)//ig'