2

I am working with a SQL Server database table similar to this

USER_ID varchar(50), FILE_NAME ntext, FILE_CONTENT ntext

sample data:

USER_ID:      1
FILE_NAME:    (AttachedFiles:1)=file1.pdf
FILE_CONTENT: (AttachedFiles:1)=H4sIAAAAAAAAAOy8VXQcy7Ku….

Means regular expressions I have successfully isolated the "content" of the FILE_CONTENT field by removing the "(AttachedFiles:1)=" part resulting with a string similar to this:

content_str = "H4sIAAAAAAAAAOy8VXQcy7Ku22JmZmZmspiZGS2WLGa0xc…"

My plan was to reconstruct the file using this string to download it from the database. During my investigation process, I found this post and proceeded to replicate the code like this:

content_str = 'H4sIAAAAAAAAAO19B0AUR/v33...'
with open(os.path.expanduser('test.pdf'), 'wb') as f:
    f.write(base64.decodestring(content_str))

...getting a TypeError: expected bytes-like object, not str

Investigating further, I found this other post and proceeded like this:

content_str = 'H4sIAAAAAAAAAO19B0AUR/v33...'
encoded = content_str.encode('ascii')
with open(os.path.expanduser('test.pdf'), 'wb') as f:
    f.write(base64.decodestring(encoded))

...resulting as a successful creation of a PDF. However, when trying to open it, I get an error saying that the file is corrupt.

I kindly ask you for any suggestions on how to proceed. I am even open to rethink the process I've came up with if necessary. Many thanks in advance!

Tomalak
  • 332,285
  • 67
  • 532
  • 628
DanielaC
  • 21
  • 2

1 Answers1

0

The value of the FILE_CONTENT is base64-encoded. This means it's a string consisting of 64 possible characters which represent raw bytes. All you need to do is base64-decode the string and write the resulting bytes directly to a file.

import base64

content_str = "H4sIAAAAAAAAAOy8VXQcy7Ku22JmZmZmspiZGS2WLGa0xc=="

with open(os.path.expanduser('test.pdf'), 'wb') as fp:
    fp.write(base64.b64decode(content_str))

The base64 sequence "H4sI" at the start of your content string translates to the bytes 0x1f, 0x8b, 0x08. These bytes are not normally at the start of a PDF file, but indicate a gzip-compressed data stream. It's possible that a PDF reader won't understand this.

I don't know for certain if gzip compression is a valid part of the PDF file format, but it's a valid part of web communication, so maybe the file stream was compressed for transfer/download and has not been decompressed before writing it to the database.

If your PDF reader does not accept the data as is, decompress it before saving it to file:

import gzip

# ...

with open(os.path.expanduser('test.pdf'), 'wb') as fp:
    fp.write(gzip.decompress(base64.b64decode(content_str)))
Tomalak
  • 332,285
  • 67
  • 532
  • 628
  • Thanks Tomalak! I tried your suggestion but I am now getting an "EOFError: Compressed file ended before the end-of-stream marker was reached" When investigating further, I reached some threads suggesting that the error is due to file corruption. Any further suggestions would be much appreciated. – DanielaC Nov 20 '18 at 15:23
  • First, try to write the stream to file without passing it through `gzip.decompress()`. Then try to open the resulting file with your PDF reader, just to check the off-chance that it knows what to do. If it complains, try opening the resulting file in 7zip (which can deal with all kinds of compression formats) to find out if there is anything in it at all. Maybe `gzip.decompress()` is not the right tool yet, it was an educated guess of mine. – Tomalak Nov 20 '18 at 15:29
  • I created a pdf without gzip.decompress() and failed to open it in the reader. I proceeded to change the extension of the pdf to .zip, .rar, .7z and failed to extract with 7zip. However when decompressing the .gzip the error I get is "Unexpected end of data". Thanks again! – DanielaC Nov 20 '18 at 15:58
  • Can you upload the file you currently have somewhere? I can try and take a look at it, maybe I can figure something out. No promises though. – Tomalak Nov 20 '18 at 16:02
  • Thank you so much Tomalak! On my github now: https://github.com/dcct84/encodedfiles_test/ – DanielaC Nov 20 '18 at 16:52
  • Okay, thanks for the Gist. Unfortunately I have bad news. The database field contents is a gzip-compressed stream, but the file `content_original_from_database.txt` has a suspicious size. It's 32,768 bytes long, which is 32kB *on the mark*. This is not an accident. Only the first 32 kB of the data stream were written to the database. The data is not recoverable, I'm afraid. – Tomalak Nov 20 '18 at 17:55
  • ...unless of course the data in the database is actually longer, but you are using a database connection that cuts off BLOBs at 32 kB, this is also a possibility. In this case you must reconfigure your connection settings and extract the data again. – Tomalak Nov 20 '18 at 18:03
  • Wow! Thank you so much! I will talk to the db admin to see if they can help me reconfiguring the connection. I hope I can come back to this thread with a solution. Again, thank you very much! – DanielaC Nov 20 '18 at 18:17
  • No problem at all. Good luck with the DBA, I hope they can help! – Tomalak Nov 20 '18 at 18:25