10

I have written a pausable multi-thread downloader using requests and threading, however the downloads just can't complete after resuming, long story short, due to special network conditions the connections can often die during downloads requiring refreshing the connections.

You can view the code here in my previous question:

Python multi connection downloader resuming after pausing makes download run endlessly

I observed the downloads can go beyond 100% after resuming and won't stop (at least I haven't see them stop), mmap indexes will go out of bounds and lots of error messages...

I have finally figured out this is because the ghost of previous request, that makes the server mistakenly sent extra data from last connection that was not downloaded.

This is my solution:

  • create a new connection
s = requests.session()
r = s.get(
    url, headers={'connection': 'close', 'range': 'bytes={0}-{1}'.format(start, end)}, stream=True)
  • interrupt the connection
r.close()
s.close()
del r
del s

In my testing, I have found that requests have two attributes named session, one Titlecase, one lowercase, the lowercase one is a function, and the other is a class constructor, they both create a requests.sessions.Session object, is there any difference between them?

And how can I set keep-alive to False?

The method found here won't work anymore:

In [39]: s = requests.session()
    ...: s.config['keep_alive'] = False
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-39-497f569a91ba> in <module>
      1 s = requests.session()
----> 2 s.config['keep_alive'] = False

AttributeError: 'Session' object has no attribute 'config'

This method from here doesn't throw errors:

s = requests.session()
s.keep_alive = False

But I seriously doubt that it has any effect at all, it just adds a new boolean to the object and I don't think it is handled by the object.

I have seen a close method in requests.models.Response, does it have any effect in this case or I can just close the session?

And finally, using this method, is it guaranteed that the server will never send extra bytes from previous dead connections?

Ξένη Γήινος
  • 2,181
  • 1
  • 9
  • 35
  • 1
    You can set `Connection: close` in headers or use HTTP 1.0 – Olvin Roght Aug 14 '21 at 12:08
  • If the server is buggy, nothing you can do will *guarantee* that it won't continue to exhibit buggy behavior. – larsks Aug 14 '21 at 12:50
  • What Is the question you want answered? I count four question marks in the post. –  Dec 05 '21 at 01:10
  • @ThorSummoner, which question would you like answered? –  Dec 05 '21 at 01:12
  • Isn't one of the features of a `requests.Session` instance the reusing of the TCP connection? So do a `requests.get(url, ...)` instead and forget about using a `Session` instance. – Booboo Dec 05 '21 at 11:44
  • @Strom I got here from google looking for "how to make requests use a new connection", I want to know that. On related SO answers, some suggested closing the current `request.connection.close()` or `response.connection.close()`, or the specific `session` connection. after a general answer, i am also curious to know if any special must be done with requests is using a connection pool. more concretely, I think i'm having issues where come of the connections in a connection pool become stale and need to be reestab, requests doesn't seem to perform that restab, i get intermittent tcp reset exepts – ThorSummoner Dec 06 '21 at 05:52
  • Cany you try with a class var instead for the session? ``` self.s = requests.session() ``` not 100% sure with python. but in other langs this would be better as that would mean that object in memory gets replaced by the new one instead of making new ones in the function each time. good because reduces memory leaks but also you will know your old session was destroyed. My feeling is you're passing around a stale session somewhere or its stuck in a closure. – byteface Dec 08 '21 at 09:04
  • Do you really need pausing or just need to simulate them? – brunoff Dec 09 '21 at 19:16

2 Answers2

1

In general, with python, when there is some kind of 'handler' that is supposed to close after use, wrapping the use with with can limit the scope of the thing to a small code block.

ResponseData = None
with requests.get( Url, headers=Headers) as ResponseObject:
    ResponseData = ResponseObject.text.encode('utf-8')
    ResponseData = ResponseData.decode("utf-8") 

#code down here does not have any idea what "ReponseObject" is.
#For some reason python is able to kill it more reilably after `with`

Not sure if this is a cannonical answer but it might help for you. The snippet works for me, but I avoid creating a session entirely. This trick has worked for me for countless other things that were supposed to close, but did not.

EDIT: Following up on session: I guess you can double with and see if it works?

with requests.session() as s:
    with s.get(....) as r:
        #try stuff here
D A
  • 3,130
  • 4
  • 25
  • 41
1

I guess that your problem isn't server related. Probably servers are behaving correctly and the problem are the threads.

Considering the code from the related question, if it is up to date, when PAUSE is set to true, which happens during 50% of the time when first argv argument is set to 1, dozens of threads are created every second (actually num_connections threads, the (pressed - lastpressed).total_seconds() > 0.5 and self.paused = not self.paused logic makes a new batch start every second). In linux you would check this with tp -H -p $pid or watch ps -T -p $pid or watch ls /proc/$pid/task/ - you are probably using windows and there are the windows ways to check this.

Each batch of connections are correct when considered in isolation, the connection range is being correctly set on the headers. By sniffing yourself you'll see that they are just fine. The problem arrises when new batches of threads arrive doing the same work. You get a lot of threads downloading similar ranges in different batches giving you the same data. Since your writing logic is relative not absolute, if two threads gives you the same 123th chunk your self.position += len(chunk) will increase for both similar chunks, which can be a reason you get your over-100%.

To test whether what I said happens, just try to download an ever increasing file and check if your file being saved does not suffer from this double increases:

0000000000 00 00 00 00 00 00 00 01 00 00 00 02 00 00 00 03   ................
0000000010 00 00 00 04 00 00 00 05 00 00 00 06 00 00 00 07   ................
0000000020 00 00 00 08 00 00 00 09 00 00 00 0a 00 00 00 0b   ................
0000000030 00 00 00 0c 00 00 00 0d 00 00 00 0e 00 00 00 0f   ................

Or simulate one file range server yourself by doing something similar to this:

#!/usr/bin/env python3
from http.server import BaseHTTPRequestHandler, HTTPServer
import time

hostname = "localhost"
serverport = 8081
filesizemegabytes = 8#.25
filesizebytes = int(filesizemegabytes*1024*1024)
filesizebytestr = str(filesizebytes)

class Server(BaseHTTPRequestHandler):
    def do_GET(self):
        self.do(True)
    def do_HEAD(self):
        self.do(False)
    def do(self,writebody=True):
        rangestr = self.headers.get('range')
        if type(rangestr) is str and rangestr.startswith('bytes='):
            self.send_response(206)
            rangestr = rangestr[6:]
            rangeint = tuple(int(i) for i in rangestr.split('-'))
            self.send_header('Content-Range', 'bytes '+rangestr+'/'+filesizebytestr)
        else:
            self.send_response(200)
            rangeint = (0,filesizebytes)
        self.send_header('Content-type', 'application/octet-stream')
        self.send_header('Accept-Ranges', 'bytes')
        self.send_header('Content-Length', rangeint[1]-rangeint[0])
        self.end_headers()
        if writebody:
            for i in range(rangeint[0],rangeint[1]):
                self.wfile.write(i.to_bytes(4, byteorder='big'))

if __name__ == '__main__':
    serverinstance = HTTPServer((hostname, serverport), server)
    print("Server started http://%s:%s" % (hostname, serverport))
    try:
        serverinstance.serve_forever()
    except KeyboardInterrupt:
        pass
    serverinstance.server_close()

Considerations about resource usage

You don't need multithreading for multidownloads. "Green" threads are enough since you don't need more than one CPU, you just need to wait for IO. Instead of multithread+requests, a more suitable solution would be asyncio+aiohttp (aiohttp once requests is not very well designed for async, altough you will find some adaptations in the wild).

Lastly, keep-alives are useful when you are planning to reconnect again, which seems to be your case. Are you source and origin IPs:ports the same? You are trying to force close of connections, but once you realize the problem are not the servers, reanalyze your situation and see whether it is not better to keep-alive connections.

brunoff
  • 4,161
  • 9
  • 10