2

I make a requests.post() call to the server, which replies me with a json, in this json there are some keys and also the base64 file.

This is an example of a response from the server:

The server responds like this:

  • 'success' is the key to understanding if access with private data is correct.
  • 'message' is the key in case success is False (In this case being success == True, the message is not shown
  • 'data' is the dictionary key that contains the fileName and the base64 format file

So:

{'success': True,
 'message': '',
 'data': {'fileName': 'Python_logo_and_wordmark.svg.png',
          'file': 'iVBORw0KGgoAAAANSUhEUgAABLAAAA....'}} #To limit the space, I cut the very long bytes example

So the respose in json also contains the file, which I need to decode with base64.b64decode(r.json()['data']['file'])

Everything ok, I can get my file and decrypt it correctly.

The problem is that with large files I would like to use the stream method like this:

file = "G:\Python_logo_and_wordmark.svg.png"
if os.path.isfile(file):
    os.remove(file)

def get_chunk(chunk):

    # Try to decode the base64 file (Chunked)
    # is this a wrong approach?
    chunk = chunk.decode("ascii")
    chunk = chunk.replace('"', '')
    if "file" in chunk:
        chunk = chunk.split('file:')[1]
    elif "}}" in chunk:
        chunk = chunk.split('}}')[0]
    else:
        chunk = chunk
    
    chunk += "=" * ((4 - len(chunk) % 4) % 4)
    chunk_decoded = base64.b64decode(chunk)
    return chunk_decoded

r = requests.post(url=my_url, json=my_data, stream=True)

iter_content = r.iter_content(chunk_size=64)
    
while True:
    chunk = next(iter_content, None)
    if not chunk:
        break
    chunk_decoded = get_chunk(chunk)

    with open(file, "ab") as file_object:
        file_object.write(chunk_decoded)

iter_content chunks return this:

b'{"success":true,"message":"","data":{"fileName":"Python_logo_and'
b'_wordmark.svg.png","file":"iVBORw0KGgoAAAANSUhEUgAABLAAAAFkCAYAA'
b'AAwtsJRAAAABGdBTUEAALGPC\\/xhBQAAACBjSFJNAAB6JgAAgIQAAPoAAACA6AAA'
b'dTAAAOpgAAA6mAAAF3CculE8AAAABmJLR0QA\\/wD\\/AP+gvaeTAACAAElEQVR42u'
b'zdeZwbdf0\\/8Nf7k2Ovdttyt7QIggoth1qUW1AQ5PLeAiK13UwWiqLiBZ4Eb+T6+'

There are errors inherent in padding sometimes in decoding, but after 1 week of trying I preferred to ask this question here, as I am afraid of being wrong approach to this situation. I would like how to handle this situation in the right way

NoobCat
  • 91
  • 1
  • 9
  • Does this answer your question? [Download large file in python with requests](https://stackoverflow.com/questions/16694907/download-large-file-in-python-with-requests) – rzlvmp Aug 11 '21 at 12:27
  • Hello @rzlvmp it looks similar, but actually my problem is that the encoded file is contained in the json response, I don't have to write the json, but I have to write the file that is contained in the response so in the {"data": {"file": "b64string"}} – NoobCat Aug 11 '21 at 12:40
  • what error are you exactly getting? can you explain a bit more with log? – devReddit Aug 11 '21 at 13:20
  • hello @devReddit My problem here is that I'm afraid of taking the wrong approach to the situation, so I can't find an answer anywhere. I got all kinds of errors, I also tried to do a base 64 fill, but in the end the image I get is damaged, so I posted my example of approach to understand if this type of approach can actually work, or is very wrong. I find many answers about downloading files in chunks with the method stream = True, but no example in my case. – NoobCat Aug 11 '21 at 13:25
  • 1
    @NoobCat check my answer – devReddit Aug 11 '21 at 14:36

4 Answers4

1

According to your requirement mentioned in the comment, I'm pointing out the current issues and probable future problems below:

In your get_chunck function, you're doing this:

chunk = chunk.decode("ascii")
chunk = chunk.replace('"', '')
if "file" in chunk:
    chunk = chunk.split('file:')[1]
elif "}}" in chunk:
    chunk = chunk.split('}}')[0]
else:
    chunk = chunk

Now look into the first chunk given by iter_line:

b'{"success":true,"message":"","data":{"fileName":"Python_logo_and'
  1. So, it will fall under the condition if "file" in chunk: as it contains this file string in the fileName. So when it will try to split this based on file:, it will return a list of one element, because the file was in fileName, not as file:. Hence the program will through following error:
Traceback (most recent call last):
  File "main.py", line 7, in <module>
    chunk = chunk.split('file:')[1]
IndexError: list index out of range

try if "file:" in chunk: instead.

  1. Your program may also fail if the fileName contains something like "prod_file:someName". You have to check for that too.

  2. A chunk that doesn't contain file can contain }}, so it can break what you're trying too achieve too.

You can modify the response server and wrap the start and ending of the file base64 encoded string with unique identifiers so that you can receive the response as below and therefore can identify the start and end of the file with guarantee in this stream approach. For example:

{'success': True,
 'message': '',
 'data': {'fileName': 'Python_logo_and_wordmark.svg.png',
          'file': '0000101100iVBORw0KGgoAAAANSUhEUgAABLAAAA....0000101101'}}

I've appended 0000101100 as starting identifier and 0000101101 as ending. You can trim them off while writing to chunk/file. You can use any other unique identifier format as your own, not conflicting the base64 encoding.

Feel free to ask if there's any further confusion.

devReddit
  • 2,696
  • 1
  • 5
  • 20
  • Sounds very interesting,although to be honest,my fear is that this process is one more process, let me explain.If I have not misunderstood: we are transforming the respose into a json,and therefore from the json I will have to pull out the file(Correct me if I'm wrong) I'm afraid of 2 things: 1) Many files to download, even thousands of thousands (So the process may take a long time) 2) Some files are very large, so they would take up a lot of memory when transforming the file into a real file. Forgive my doubts, but I am doing a very important job with perhaps too much responsibility on it. – NoobCat Aug 11 '21 at 14:56
  • I was thinking about writing the file during the iter process, this seems very complicated. – NoobCat Aug 11 '21 at 14:57
  • @NoobCat how can you distinguish between a chunk from file and the case where the chunk has various parts of json other than file, lets say, a long message or a message which contains the word `file` too? – devReddit Aug 11 '21 at 15:02
  • I could identify the part that is not file, better (Your observation on "file" is very useful as it gives an error) I thought of rebuilding the json in a respose let's say the first chunk, obviously in the example I set stream = True, with chunk_size = 64, but I think to insert the highest number, something like 512 * 1024, and therefore I would not have big problems with get the json part of response. {"success": true, "message": ""} and then grab only the "file" I thought I would then write the file part, directly into the while True loop. – NoobCat Aug 11 '21 at 15:09
  • 1
    @NoobCat I understand your point. I've updated my answer according to your requirement. Tried to point out the risks and proposed a way to avoid the hassles while keeping the implementation as you are expecting. It's now your belief whether the answer was helpful or not. Let me know if you have any further query – devReddit Aug 11 '21 at 15:45
  • 1
    Very useful your advice, really. I think I will keep the question here, for a few days in the meantime I will do some tests on this script, besides it is really complicated to manage the base64 chunks. I must have clearer ideas on this. – NoobCat Aug 11 '21 at 15:50
  • best of luck mate! – devReddit Aug 11 '21 at 15:51
1

I tried to analyze your problem, and can't find solution better than @devReddir provided.

The reason is - it is impossible (or very difficult) to parse data before completely download it.

Workaround may be to save data as is in one big file and parse it by separate worker. That will allow to decrease server memory usage, when downloading file and avoid to loss data.

  1. save file as is
...
while True:
    chunk = next(iter_content, None)
    if not chunk:
        break
    with open(file, "ab") as file_object:
        file_object.write(chunk)
...
  1. read file in separated worker
import json
import base64

with open("saved_as_is.json") as json_file:
    json_object = json.load(json_file)

encoded_base64 = json_object['data']['file']
decoded = base64.b64decode(encoded_base64)
...

Why parse data on the fly is so difficult?

  1. file separator may be splitted by two chunks:
b'... ... ... .., "fi'
b'le": "AAAB... ... .'
  1. Actually \\ is a escape symbol and you must to handle it manually (and don't forget that \\ may be splitted by chunks → b'...\', b'\...'):
b'dTAAAOpgAAA6mAAAF3CculE8AAAABmJLR0QA\\/wD\\/AP+gvaeTAACAAElEQVR42u'
  1. If file is super tiny, chunk line may be look like:
b'"file":"SUPERTINY_BASE64_DECODED", "fileName":"Python_lo'

And chunk.split('file:')[1] will don't work

  1. base64 chunk must be multiple of 4, so if your first chunk (characters after "file":) will be 3 character length, you will be need to read next chunk and add one first character to end of previous chunk for all following iterations

So here is tones of nuances if you will try to parse data manually.

Howevevr, if you want to choose this hard way, here is how to decode base64 chunks.

And here is list of allowed base64 characters

If you want to use @devReddir's solution and store whole data in memory, not sure if here any profit of stream usage at all.

rzlvmp
  • 7,512
  • 5
  • 16
  • 45
  • Hello, i think you are also suggesting, like @devReddit to write a json, and it almost seems like the best solution, but still not 100% sure. I have to test this road well. As it would be a matter of creating a sort of ".tmp" file in json format, instead of creating a ready-made "file.tmp". The problem, however, I think it may reveal itself when you download a file (Suppose 10Gb or more). I would have to extract the file from the json and load it into memory and then convert it to base64 all in one go. I think this can really be a danger to RAM on low memory computers. – NoobCat Aug 11 '21 at 17:06
1

Okay, that is complete working solution:

Server side (main.py):

I added this code to be able run test server that responding json data with base64 encoded file.
Also I added some randomness in response to be able to check if string parsing independent on character position

import base64 as b
import json as j
from fastapi import FastAPI as f
import requests as r
import random as rr
import string as s
import uvicorn as u

banana_url = 'https://upload.wikimedia.org/wikipedia/commons/c/ce/PNG_demo_Banana.png'
banana_b64 = b.encodebytes(
    r.get(banana_url, stream=True).raw.read())
banana_b64 = banana_b64.decode('ascii').replace('\n', '').encode('ascii')

def get_response(banana_file, banana_file_name):
    random_status = ''
    for i in range(rr.randint(3, 30)): random_status += rr.choice(s.ascii_letters)

    banana_response = {
        'status': random_status,
        'data': {
            'fileName': banana_file_name.split('/')[-1],
            'file': banana_file,
        }
    }

    if len(random_status) % 2 == 0:
        banana_response['data']['random_payload'] = 'hello_world'
        banana_response['random_payload'] = '%hello_world_again%'

    return banana_response

app = f()

@app.get("/")
async def read_root():
    resp = get_response(banana_b64, banana_url.split('/')[-1])
    print('file length:', len(resp['data']['file']))
    return resp

if __name__ == "__main__":
    u.run('main:app', host="0.0.0.0", port=8000, reload=True, workers=1)

Client side (file downloader decoder.py):

import requests
import base64

# must be larger than len('"file":')
CHUNK_SIZE = 64

# iterable response
r = requests.get('http://127.0.0.1:8000', stream=True).iter_content(chunk_size=CHUNK_SIZE)

class ChunkParser:

    file = None
    total_length = 0

    def close(self):
        if self.file:
            self.file.close()

    def __init__(self, file_name) -> None:
        self.file = open(file_name, 'ab')

    def add_chunk(self, chunk):

        # remove all escape symbols if existing
        chunk = chunk.decode('ascii').replace('\\', '').encode('ascii')

        # if chunk size is not multiple of 4, return modulo to be able add it in next chunk
        modulo = b''
        if not (l := len(chunk)) % 4 == 0:
            modulo = chunk[l-(l%4):]
            chunk = chunk[:l-(l%4)]

        self.file.write(base64.b64decode(chunk))
        self.total_length += len(chunk)

        return modulo



prev_chunk = None
cur_chunk = None
writing_started = False
last_chunk = False
parser = ChunkParser('temp_file.png')
file_found = False
while True:
    
    # set previous chunk on first iterations before modulo may be returned
    if cur_chunk is not None and not writing_started:
        prev_chunk = cur_chunk
    
    # get current chunk
    cur_chunk = next(r, None)
    
    # skip first iteration
    if prev_chunk is None:
        continue
    
    # break loop if no data
    if not cur_chunk:
        break
    
    # concatenate two chunks to avoid b' ... "fil', b'e": ... ' patern
    two_chunks = prev_chunk + cur_chunk

    # if file key found get real base64 encoded data
    if not file_found and '"file":' in two_chunks.decode('ascii'):
        file_found = True

        # get part after "file" key
        two_chunks = two_chunks.decode('ascii').split('"file":')[1].encode('ascii')
        
    if file_found and not writing_started:
        # data should be started after first "-quote
        # so cut all data before "
        if '"' in (t := two_chunks.decode('ascii')):
            two_chunks = t[t.find('"')+1:].encode('ascii')
            writing_started = True
        # handle b' ... "file":', b'"... ' patern
        else:
            cur_chunk = b''
            continue

    # check for last data chunk
    # "-quote means end of value
    if writing_started and '"' in (t := two_chunks.decode('ascii')):
        two_chunks = t[:t.find('"')].encode('ascii')
        last_chunk = True

    if writing_started:

        # decode and write data in file
        prev_chunk = parser.add_chunk(two_chunks)

        # end operation
        if last_chunk:
            if (l := len(prev_chunk)) > 0:
                # if last modulo length is larget than 0, that meaning the data total length is not multiple of 4
                # probably data loss appear? 
                raise ValueError(f'Bad end of data. length is {str(l)} and last characters are {prev_chunk.decode("ascii")}')
            break

parser.close()
print(parser.total_length)

Don't forget to compare files after download when testing this script:

# get md5 of downloaded by chunks file
$ md5 temp_file.png
MD5 (temp_file.png) = 806165d96d5f9a25cebd2778ae4a3da2
# get md5 of downloaded file using browser
$ md5 PNG_demo_Banana.png
MD5 (PNG_demo_Banana.png) = 806165d96d5f9a25cebd2778ae4a3da2
rzlvmp
  • 7,512
  • 5
  • 16
  • 45
1

You could stream it down to a file like this (pip install base64io):

class decoder():
    def __init__(self, fh):
        self.fileh = open(fh, 'rb')
        self.closed = False
        search = ''
        start_tag = '"file": "'
        for i in range(1024):
            search += self.fileh.read(1).decode('UTF8')
            if len(start_tag) > len(search)+1:
                continue
            if search[-len(start_tag):] == start_tag:
                break

    def read(self, chunk=1200):
        data = self.fileh.read(chunk)
        if not data:
            self.close()
            return b''
        return data if not data.decode('UTF8').endswith('"}}') else data[:-3]

    def close(self):
        self.fileh.close()
        self.closed = True

    def closed(self):
        return self.closed

    def flush(self):
        pass

    def write(self):
        pass

    def readable(self):
        return True

And then use the class like this:

from base64io import Base64IO
encoded_source = decoder(fh)
with open("target_file.jpg", "wb") as target, Base64IO(encoded_source) as source:
    for line in source:
        target.write(line)

But of course you need to change from streaming from local file to streaming from the requests.raw object.

delica
  • 1,647
  • 13
  • 17