Python Decoding binary data back to file

Question

i have a database in MSSQL with compressed and converted files, looks like this:

screenshot of values(every of them is 40k symbols long

i need to decode these files to pdf, docx and png files.

i've tried to do this via base64, but it didn't build correct files.

Do you have any ideas how could i decode all of them and build to correct file?

*i've tried to do this via base64* - the data shown in the picture is hex data. To test it, you can feed these value into an online converter (e.g. https://onlinehextools.com/convert-hex-to-binary) and see what comes out. — jps, Jan 19 '22 at 12:02
here is the link to the test file, which is PNG image [link](https://dropmefiles.com.ua/ru/YLWGMf5aVU) — Arseniy, Jan 19 '22 at 12:03
@jps but i need to do this few thousand times, so solution should be in a python way — Arseniy, Jan 19 '22 at 12:05
it was only meant for testing. To write a python program, look into https://stackoverflow.com/questions/26796770/writing-hex-data-into-a-file — jps, Jan 19 '22 at 12:14
Maybe you could generate a fresh database, and put some knonw PNG files of varying sizes into it and see how many blobs you get for each and share the actual resulting blobs? — Mark Setchell, Jan 24 '22 at 15:54

K J · Accepted Answer · 2022-01-30T17:07:11.253

As your learning the hard way stuffing blobs into a text database is probably the worst sin a data manager could commit as a novice, bloated unwieldy and slow it is best if the source files are left in their fast natural native compressed state and simply referenced in the DB by a related unique ID and file storage name. Rant over.

The fact that they are fixed size blocks of 40K suggests they are chunked in pieces thus several odd chunks needed to create one whole BLOB.

The blob you presented appears to be just part of a PNG image that should be, if I am interpreting correctly =

2164 pixels wide by 835 pixels high = 22.54 x 8.70 inches

HOWEVER the output is only 4 pixels high within that oddly suspect size canvas, which might be correct, if its just the first part of a much longer truncated stream. The colour range from such a narrow band does not help determine the subject matter however there appears to be a distinct near white margin down the right hand side, but not on the top or left edge?.

Your 40K chunk translates to about 20K binary with the characteristics of a PNG BUT a PNG STARTS WITH 89 so you are having a problem since that is prefixed with 0x 00 22 40 DD BF (decimal=574676415 thus too big for the expanded PNG memory requirement which is estimated to be 5,420,860 Bytes)

We can discard the 0x as the signature for a Hex stream and use the remainder as I did above, but what is the significance of the Odd 00 22 40 DD BF (most likely contains in part an indicator of the type or final full length size and or pointer to the next chunk)

What you need to do is extract that image by your normal method and compare the total expected file size, since translated into 20 KB binary it can only equate to a small 0.5 percent of the total to be expected. In that case you need to determine how & where the rest of the image is stored in order to concatenate all the (200 ?) parts into one homogeneous blob i.e. a single image.

You need to have sight of the method where chunks are extracted slowly converted slowly and stitched together slowly, but using some measure of expected file size. What we know is your entry has 5 bytes before the data body but the norm for a largeblob is 4 and for mediumblob should be 3 see https://www.educba.com/mysql-blob/ thus we have no idea why it is not normal other than it was done that way by a programmer. I

A fairly similar problem that i suggested needed knowledge of DB structure is at How to retrieve original pdf stored as MySQL mediumblob? the answer was interrogate the developer that had placed the data in an even more odd way than yours.

Thank you a lot for this explanation. But this structure was not made by me, it was made by some guy who developed our company database client, and i wanted to extract data from these table, but it seems that it's not possible, until you know how it was separated into small parts.. — Arseniy, Jan 20 '22 at 06:21

score 1 · Answer 2 · answered Jan 24 '22 at 11:26

1

Your data appears to be a PNG with something pre-pended to the front of it. If you strip the first 12 bytes with dd and then revert the hex to binary with xxd you can recover the start of a PNG file:

dd bs=12 skip=1 if=YOURFILE | xxd -r -p > image.png

You can then check that PNG file and see its size and the fact that it is truncated like this:

pngcheck -v image.png

Sample Output

File: image.png (21833 bytes)
  chunk IHDR at offset 0x0000c, length 13
    2164 x 835 image, 24-bit RGB, non-interlaced
  chunk sRGB at offset 0x00025, length 1
    rendering intent = perceptual
  chunk gAMA at offset 0x00032, length 4: 0.45455
  chunk pHYs at offset 0x00042, length 9: 3779x3779 pixels/meter (96 dpi)
  chunk IDAT at offset 0x00057, length 65445:  EOF while reading data
ERRORS DETECTED in image.png

answered Jan 24 '22 at 11:26

Mark Setchell

191,897
31
273
432

@KJ The 21833 is the number of bytes that `xxd` recovered from the text file supplied by the OP. – Mark Setchell Jan 24 '22 at 15:31
@KJ For my part, I don't understand what the significance of MSSQL is here - I perceived the file supplied by the OP to be something they obtained by dumping a BLOB so I don't see why it matters whether the blob came from MSSQL, sqlite3 or a disk-based file. But hey, you've already provided an accepted answer, already upvoted by me so I guess everyone is happy :-) – Mark Setchell Jan 24 '22 at 15:36
@KJ You cannot tell how big the file should be because different content will compress to a different size - repetitive blocks of solid colour will compress to almost nothing, photos will hardly compress at all. – Mark Setchell Jan 24 '22 at 15:37

score 0 · Answer 3 · answered Jan 19 '22 at 12:20

0

The data is hex-encoded, try:

from base64 import b16decode

# Data 
encoded = '0x48656C6C6F'
decoded = b16decode(encoded[2:])
print(decoded)

Outputs b'Hello'

answered Jan 19 '22 at 12:20

vaizki

1,678
1
9
12

it returns error: Odd-length string – Arseniy Jan 19 '22 at 12:24
Sounds like your data is bad. Hex encoding is basically 4 bits per character so representing bytes (8bits) always requires 2 characters in the range 0-F. If the string length is odd then most probably the data has been truncated when feeding into the DB? – vaizki Jan 19 '22 at 12:27
as i know it has been compressed or something like that.... Is there any way, that it could work? – Arseniy Jan 19 '22 at 12:39
It would be highly unusual to see any data that wasn't in whole bytes, compressed or not. You can try dropping the last character to make the string even length and seeing if the results look like the files you are expecting. But I still suspect your data is truncated. – vaizki Jan 19 '22 at 13:02
i am 100% sure that data is absolutely full, because it is possible to open and save this file in a desktop application connected to that database. But it is too slow to save each file one by one, so i need to make it easier with python script that will be able to decode them and download – Arseniy Jan 19 '22 at 13:38

Python Decoding binary data back to file

3 Answers3