33

I was attempting to do a sed replacement in a binary file however I am beginning to believe that is not possible. Essentially what I wanted to do was similar to the following:

sed -bi "s/\(\xFF\xD8[[:xdigit:]]\{1,\}\xFF\xD9\)/\1/" file.jpg

The logic I wish to achieve is: scan through a binary file until the hex code FFD8, continue reading until FFD9, and only save what was between them (discards the junk before and after, but include FFD8 and FFD9 as the saved part of the file)

Is there a good way to do this? Even if not using sed?

EDIT: I just was playing around and found the cleanest way to do it IMO. I am aware that this grep statement will act greedy.

hexdump -ve '1/1 "%.2x"' dirty.jpg | grep -o "ffd8.*ffd9" | xxd -r -p > clean.jpg
Ryan
  • 3,579
  • 9
  • 47
  • 59
  • Always beware of false matches when grepping for patterns in what's essentially random data, such as a compressed binary stream! – dwarring Apr 09 '10 at 05:47
  • @snoopy - (1) is there a better solution? (2) if not, what needs to be done to ameliorate this? Stop searching once some "end of metadata" is reached? – DVK Apr 09 '10 at 06:31
  • Depends exactly what you're doing but the CPAN module Image::EXIF lets you extract and change metadata. Might be of use here. – dwarring Apr 09 '10 at 07:08
  • FYI, the purpose of this question was for doing manual file carving in a RAID 5 scenario. When grabbing stripes and chunks you will get data before and after the jpg (or any other file). This was meant to clean it. – Ryan Apr 09 '10 at 15:15

6 Answers6

44

bbe is a "sed for binary files", and should work more efficiently for large binary files than hexdumping/reconstructing.

An example of its use:

$ bbe -e 's/original/replaced/' infile > outfile

Further information on the man page.

starfry
  • 9,273
  • 7
  • 66
  • 96
Ivan Tarasov
  • 7,038
  • 5
  • 27
  • 23
  • When I use it on a block device by redirecting it back *(through `-o` option)* into the same device, it seems it modifies more text than the text I wanted to modify. LVM can't even recognize the device as part of a pool after the edit. – Hi-Angel Jan 14 '22 at 13:28
8

Old question, but,

xxd infile | sed 's/xxxx xxxx/yyyy yyyy/' | xxd -r > outfile

is probably the simplest and most reliable solution. Similar to the edit in the OP.

Tahlor
  • 1,642
  • 1
  • 17
  • 21
recette
  • 99
  • 1
  • 1
3

Is there a good way to do this

yes of course, use an image editing tool such as those from ImageMagick (search the net for linux jpeg , exif editor etc) that knows how to edit jpg metadata. I am sure you can find one tool that suits you. Don't try to do this the hard way. :)

ghostdog74
  • 327,991
  • 56
  • 259
  • 343
  • agree, this is essentially random binary data so you've got a 1 / (2 ** 16) of getting a false positive when searching for any 2 byte sequence. That's about once every 65K of data. – dwarring Apr 09 '10 at 05:24
  • exiftool (http://search.cpan.org/dist/Image-ExifTool/exiftool) is the killer application for media metadata. – daxim Apr 09 '10 at 08:00
  • Just copying my above comment down here: FYI, the purpose of this question was for doing manual file carving in a RAID 5 scenario. When grabbing stripes and chunks you will get data before and after the jpg (or any other file). This was meant to clean it. – Ryan Apr 09 '10 at 15:22
2

sed might be able to do it, but it could be tricky. Here's a Python script that does the same thing (note that it edits the file in-place, which is what I assume you want to do based on your sed script):

import re

f = open('file.jpeg', 'rb+')
data = f.read()
match = re.search('(\xff\xd8[0-9A-fa-f]+)\xff\xd9', data)
if match:
    result = match.group(1)
    f.seek(0)
    f.write(result)
    f.truncate()
else:
    print 'No match'
f.close()
Adam Rosenfield
  • 390,455
  • 97
  • 512
  • 589
  • please kindly how the phrase `(\xff\xd8[0-9A-fa-f]+)\xff\xd9` would change if i want to replace C:\path/sub by /path/sub ? Thank you in advance for possible answer. – 16851556 Oct 25 '20 at 17:08
  • Hey @16851556, your question intrigues me. Challenge accepted. I believe it becomes `\x43\x3a\x5c([xX]?[0-9a-fA-F]*)`. But `re.search(..., data)` will not do substitution. ;) – David Golembiowski Aug 21 '21 at 04:25
  • Rather than `\x43\x3a\x5c([xX]?[0-9a-fA-F]*)` I think it should be `\x43\x3a(\x5c([xX]?[0-9a-fA-F]*))+`. Everyone's circumstances are different, but if your employer is asking you to do this, they're completely nuts, and unless you're making crazy money, you must flee from the workplace. – David Golembiowski Aug 21 '21 at 04:37
1

Also, this Perl might work (not tested, caveat emptor)... if Python is not installed :)

open(FILE, "file.jpg") || die "no open $!\n";
while (read(FILE, $buff, 8 * 2**10)) {
    $content .= $buff;
}
@matches = ($content =~ /(\xFF\xD8[:xdigit:]+?\xFF\xD9)/g;
print STDOUT join("", @matches);

You need to add binmode(FILE); binmode(STDOUT); on DOS or VMS after the open() call - not needed on Unix.

DVK
  • 126,886
  • 32
  • 213
  • 327
  • sorry DVK - that was me. I've been bitten by bugs myself when trying to grep for short patterns in binary data. Just think there's a good chance of this mismatching, either on one or other of the anchors or completely picking up a random 'phantom pattern'. I just think that Sooner or later the OP is likely to end up with the odd scrambled jpeg and wonder why! Also downvoted others for the same reason. – dwarring Apr 09 '10 at 06:21
  • 1
    If you're saying that OP has an XY problem, please present a better solution than a regex before downloading regex solutions as "bad". If this answer has a bug, please point it out. If there's a specific pattern where regexp approach would fail, please clarify that as an answer (again XY) – DVK Apr 09 '10 at 06:29
  • 1
    Also, please note that this solution does NOT change the jpg file. Merely outputs found strings (which I'm guessing might be metadata) to standard out for later redirect/consumption – DVK Apr 09 '10 at 06:31
0
sed -i "s/$(python -c "print('\x1f', end='')")/;/g" file
h5v
  • 9
  • 2