0

I've got a large (117MB!) html file that has thousands of images encoded as base64, I'd like to decode them to JPG's but my bash-fu isn't enough to do this and I haven't been able to find an answer online

tyler
  • 1
  • 1

3 Answers3

1

In general, HTML can't be parsed properly with regular expressions, but if you have a specific limited format then it could work.

Given a simple format like

<body>
<img src="">
<img src=""><img src="">
<div><img src=""></div>
</body>

the following can pull out the data

i=0; awk 'BEGIN{RS="<"} /="data:image\/jpeg;base64,[^\"]*"/ { match($0, /="data:image\/jpeg;base64,([^\"]*)"/, data); print data[1]; }' test.html | while read d; do echo $d  | base64 -d > $i.jpg; i=$(($i+1)); done

To break that down:

i=0 Keep a counter so we can output different filenames for each image.

awk 'BEGIN{RS="<"} Run awk with the Record Separator changed from the default newline to <, so we always treat each HTML element as a separate record.

/="data:image\/jpeg;base64,[^\"]*"/ Only run the following commands on records that have embedded base64 jpeg data.

{ match($0, /="data:image\/jpeg;base64,([^\"]*)"/, data); print data[1]; }' Pull out the data itself, the part matched with parentheses between the comma and the trailing quotation mark, then print it.

test.html Just the input filename.

| while read d; do Pipe the output base64 data to a loop. read will put each line into d until there's no more input.

echo $d | base64 -d > img$i.jpg; Pass the current image through the base64 decoder and store the output to a file.

i=$(($i+1)); Increment to change the next filename.

done Done.

There are a few things that could probably be done better here:

  • There should be a way to get the line-match regexp to capture the base64 data directly, instead of repeating the regexp in a call to the match() function, but I couldn't get it to work.
  • I don't like the technique of reading a pipe into the variable d, only to echo it back out to another pipe - it would be nicer to just pipe straight through - but base64 doesn't know to only use one line of the input.
  • For some reason I have not yet figured out, incrementing the counter directly where it's used (i.e. echo $d | base64 -d > img$((i++)).jpg) only wrote to the first file, even though echo $d > img$((i++)).b64 correctly wrote the encoded data to multiple files. Rather than waiting on working that out, I've just split the increment into its own command.
Harun
  • 1,582
  • 1
  • 10
  • 8
0

You can try scrapping the encoded strings of the images using Python. Then check this out for converting the encoded strings to images.

godsuya
  • 1
  • 3
0
  1. Use regex to direct the base64 images to separate files
  2. Write loop to iterate through your files.
  3. Bash command to decode files will be along lines of: cat base64_file1 |base64 -d > file1.jpg
Kyle Banerjee
  • 2,554
  • 4
  • 22
  • 30
  • thanks, it's not like I've been trying that for the last week – tyler Sep 27 '18 at 18:25
  • Is there any reason you can't just put the image tags one per line, strip out the tag data, and echo the data line by line into the decoder? The regexes for that would be easy. – Kyle Banerjee Sep 27 '18 at 18:30
  • its's 117MB, and has more than 1000 images, I cant even open it in a text editor – tyler Sep 28 '18 at 19:28
  • Just use sed to break it up. It's possible to do in one ugly command, but something like this would work: cat htmlfile |sed 's/base64file Basically, all you're doing is bringing all the images to the beginning of the line and stripping everything but the base64 data which is left at one image per line – Kyle Banerjee Sep 28 '18 at 23:22
  • Now that I'm looking at your file more closely, you have repeated images. So you can eliminate the duplicates by piping through sort -u before sending it to the file – Kyle Banerjee Sep 28 '18 at 23:25
  • "Now that I'm looking at your file more closely, you have repeated images" no, you haven't seen the file and no, it doesn't have repeats – tyler Sep 30 '18 at 06:27
  • I mistook Harun's answer for yours since he pasted what looked like a sample and you haven't. Since the info is in a structured tag, how is it not a straightforward process to pull it out with string or XML tools? – Kyle Banerjee Oct 01 '18 at 12:44