5

I have a Flask application that shall provide an endpoint to download a large file. However, instead of providing it from the file system or generating the file on-the-fly, this file has to be downloaded first from another server via HTTP.

Of course, I could perform a GET request to the external server first, download the file completely and store it in the file system or in memory and then as a second step provide it as a result for the original request. This would look for example like this (also including a basic authentication to indicate why a simple proxy on a lower layer is not sufficient):

#!flask/bin/python
from flask import Flask, jsonify
import os
import requests
from requests.auth import HTTPBasicAuth

app = Flask(__name__)

@app.route('/download')
def download():
    auth = HTTPBasicAuth("some_user", "some_password")
    session = requests.Session()
    session.auth = auth
    response = session.get("http://example.com")
    return response.content

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=1234, debug=True)

However, this increases both the latency and the storage requirements of the application. And also, even if the receiver only requires to perform a partial download (i.e. it performs a HTTP range request) of the file, it has to be fetched from the external server completely, first.

Is there a more elegant option to solve this, i.e. to provide support for HTTP range requests that are directly forwarded to the external server?

koalo
  • 2,113
  • 20
  • 31
  • 1
    Have you looked at [StringIO](https://docs.python.org/3/library/io.html#text-i-o) ? – IMCoins Jun 24 '20 at 10:36
  • @koalo PATCH could be used to resumed uploads with a content-range header. – Shaka Flex Jun 24 '20 at 10:51
  • @IMCoins Yes, I am aware of StringIO, but I have admittedly no idea how to put the pieces together. Can I just return a StringIO object in Flask and then it automatically supports partial downloads? – koalo Jun 25 '20 at 08:53
  • @ShakaFlex Isn't PATCH for partial modifications (client to server) and not partial downloads (server to client)? – koalo Jun 25 '20 at 08:54
  • @koalo client to server, resume download can happen in both directions using different header's for each direction. Client resume [Resuming the HTTP Download of a File](https://www.oreilly.com/library/view/python-cookbook/0596001673/ch11s06.html) , Server resume [The PATCH Method](https://tools.ietf.org/id/draft-dusseault-http-patch-16.html#RFC2616) – Shaka Flex Jun 25 '20 at 11:05
  • 1
    @koalo another way that maybe easier to implement would be to turn the intermediary server into a HTTP proxy. Letting the client browser and end server to do the rest of the work. The only resource cost for the intermediary would be mostly network. – Shaka Flex Jun 25 '20 at 11:39
  • @koalo I'm not sure I understand what you mean when you say "partial downloads". Are you trying to know how to return a file from Flask ? – IMCoins Jun 25 '20 at 14:16
  • With partial downloads I mean range request, i.e. HTTP GET requests with the Range header set. – koalo Jun 25 '20 at 17:18
  • @ShakaFlex A simple proxy on the network layer would not be possible since for example the authorization has to be different. – koalo Jun 28 '20 at 05:37
  • @koalo I think we need a [mcve] to help you. You have a client, which will request you something. In order to give him the requested information, you also need to make another request. We need you to post a sample of what would the data look like, and what you would like to return with your API, and in which format. The format could be anything, from a file to download as-is, to a JSON content. I believe the answer lies in the StringIO/BytesIO I pointed out earlier, but if you have trouble formatting it, we need to know where you're having trouble. :) – IMCoins Jun 29 '20 at 08:36
  • I have added an example and hope it helps. – koalo Jun 29 '20 at 09:19
  • @koalo What does `response.content` looks like ? – IMCoins Jun 29 '20 at 09:53
  • @IMCoins response.content is arbitrary binary data. Can be for example the content of a ZIP file. – koalo Jun 29 '20 at 10:18
  • Use nginx or apache as a reverse proxy, it's better suited. You can implement http auth on the proxy. Syncing the flask auth and proxy auth is easier than accomplishing this with flask. – Tohmaxxx Jun 30 '20 at 14:24

1 Answers1

11

According to Proxying to another web service with Flask, Download large file in python with requests and Flask large file download I managed to make a Flask HTTP proxy in stream mode.

from flask import Flask, request, Response
import requests

PROXY_URL = 'http://ipv4.download.thinkbroadband.com/'

def download_file(streamable):
    with streamable as stream:
        stream.raise_for_status()
        for chunk in stream.iter_content(chunk_size=8192):
            yield chunk


def _proxy(*args, **kwargs):
    resp = requests.request(
        method=request.method,
        url=request.url.replace(request.host_url, PROXY_URL),
        headers={key: value for (key, value) in request.headers if key != 'Host'},
        data=request.get_data(),
        cookies=request.cookies,
        allow_redirects=False,
        stream=True)

    excluded_headers = ['content-encoding', 'content-length', 'transfer-encoding', 'connection']
    headers = [(name, value) for (name, value) in resp.raw.headers.items()
               if name.lower() not in excluded_headers]

    return Response(download_file(resp), resp.status_code, headers)


app = Flask(__name__)

@app.route('/', defaults={'path': ''})
@app.route('/<path:path>')
def download(path):
    return _proxy()

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=1234, debug=True)

download_file() will open the request in stream mode and yield every chunk as soon as they got streamed.

_proxy() create the request then just create and return a Flask Response using the iterator download_file() as content.

I tested it with https://www.thinkbroadband.com/download where several archive files are free to download for test purpose. (be careful, archives are corrupted, so you better use checksum to make sure you got the expected file).

Some examples:

curl 'http://0.0.0.0:1234/100MB.zip' --output /tmp/100MB.zip
curl 'http://0.0.0.0:1234/20MB.zip' --output /tmp/20MB.zip

I also performed some other tests on random websites to get large images. So far I got no issues.

Arount
  • 9,853
  • 1
  • 30
  • 43
  • This already solves the problem of the range request, but still the whole request has to be executed before the response is returned. – koalo Jun 30 '20 at 15:24
  • 1
    @koalo Sorry for that, I fixed my answer and it should be better now – Arount Jul 01 '20 at 11:24
  • I was getting connection reset issues, at least with a 404 error return, but used this: `return Response(stream_with_context(resp.iter_content(chunk_size=1024)), resp.status_code, headers)` and the issues went away – Charles L. Nov 15 '22 at 00:47