3

The memory usage keep growing download file using multi-thread Queue, the requests data seem to be stored in memory and not released, looks so weird. Any one know why?

Here is my source code:

import threading
from queue import Queue
import requests
import shutil

class MultiThreadsModule():
    def __init__(self, thread_num):
        self.print_lock = threading.Lock()
        self.compress_queue = Queue()
        self.thread_num = thread_num
        self.all_nodes = None
        self.func_main = None

    def thread_loop(self):
        self.thread_list = []
        for _ in range(self.thread_num):
            t = threading.Thread(target=self.process_queue)
            t.daemon = True
            t.start()
            self.thread_list.append(t)

    def node_queue(self, nodes):
        for node in nodes:
            self.compress_queue.put(node)
        self.compress_queue.join()

    def process_queue(self):
        while True:
            node = self.compress_queue.get()
            self.func_main(node)
            self.compress_queue.task_done()

    def run(self):
        self.node_queue(self.all_nodes)

def download_file_(url):
    r = requests.get(url, stream=True, timeout=600)
    return r.text

if __name__ == '__main__':
    mtm = MultiThreadsModule(20)
    mtm.all_nodes = ["https://www.sec.gov/Archives/edgar/data/913951/000095013399003276/0000950133-99-003276-d2.pdf"] * 1000
    mtm.func_main = download_file_
    mtm.thread_loop()
    mtm.run()

My memory keep growing with the download pdf size. When i shout down the download script, memory return normal as it start is.

Here is my memory change history:

(base) jay@ubuntu:~$ free 
              total        used        free      shared  buff/cache   available
Mem:       16030900     7501440     6342032      207336     2187428     8000996
Swap:       2097148       66828     2030320
(base) jay@ubuntu:~$ free 
              total        used        free      shared  buff/cache   available
Mem:       16030900     7499512     6344080      207124     2187308     8003148
Swap:       2097148       66828     2030320
(base) jay@ubuntu:~$ free 
              total        used        free      shared  buff/cache   available
Mem:       16030900     7484624     6304240      202692     2242036     8022128
Swap:       2097148       66828     2030320
(base) jay@ubuntu:~$ free 
              total        used        free      shared  buff/cache   available
Mem:       16030900     7482960     6305724      202692     2242216     8023788
Swap:       2097148       66828     2030320
(base) jay@ubuntu:~$ free 
              total        used        free      shared  buff/cache   available
Mem:       16030900     7559828     6210116      216200     2260956     7933424
Swap:       2097148       66828     2030320
(base) jay@ubuntu:~$ free 
              total        used        free      shared  buff/cache   available
Mem:       16030900     7559700     6204536      217868     2266664     7931840
Swap:       2097148       66828     2030320
(base) jay@ubuntu:~$ free 
              total        used        free      shared  buff/cache   available
Mem:       16030900     7637356     6127720      212544     2265824     7859580
(base) jay@ubuntu:~$ free 
              total        used        free      shared  buff/cache   available
Mem:       16030900     7871248     5816500      262944     2343152     7575232
Swap:       2097148       66828     2030320
(base) jay@ubuntu:~$ free 
              total        used        free      shared  buff/cache   available
Mem:       16030900     8412848     5193552      252832     2424500     7042864
Swap:       2097148       66828     2030320

The most weird stuff is if change another website file link, everything is normal, my memory never growing as above issue. the other link: sse.com.cn/disclosure/listedinfo/announcement/c/2019-12-27/… .

(base) jay@ubuntu:~$ curl -I https://www.sec.gov/Archives/edgar/data/913951/000095013399003276/0000950133-99-003276-d2.pdf
HTTP/1.1 200 OK
Date: Thursday, 16-Jan-20 12:07:34 CST
Keep-Alive: timeout=58
Content-Length: 0

HTTP/1.1 200 OK
Accept-Ranges: bytes
Content-Type: application/pdf
ETag: "9a2245050c11f5db74ed734eb31b31a0"
Last-Modified: Mon, 02 Oct 2017 21:50:29 GMT
Server: AmazonS3
x-amz-id-2: BwWoaQxPxWSoKT3cJz2fpFLf9j53sdO20m4IedR9I5ZJNBHIFyH4AuqiN9HRx45sSdw/NmhkAjs=
x-amz-meta-mode: 33188
x-amz-replication-status: REPLICA
x-amz-request-id: 33A22AD1A6F87DE0
x-amz-version-id: lVtscFRHVvquEIo8.Q7sUnmAO1nQkKm7
X-Content-Type-Options: nosniff
X-Frame-Options: SAMEORIGIN
X-XSS-Protection: 1; mode=block
Content-Length: 1741924
Date: Thu, 16 Jan 2020 04:07:35 GMT
Connection: keep-alive
Strict-Transport-Security: max-age=31536000 ; includeSubDomains ; preload

(base) jay@ubuntu:~$ curl -I http://www.sse.com.cn/disclosure/listedinfo/announcement/c/2019-12-27/603818_2018_nA.pdf
HTTP/1.1 200 OK
Content-Length: 3899122
Accept-Ranges: bytes
Age: 588
Content-Type: application/pdf
Date: Thu, 16 Jan 2020 04:08:09 GMT
Etag: "WAa9c77cbc49bb90b3"
Keep-Alive: timeout=58
Last-Modified: Thu, 26 Dec 2019 09:35:22 GMT
Server: Apache
X-Wa-Info: [V2.S11101.A12708.P79382.N26848.RN0.U4201449325].[OT/pdf.OG/documents]

The memory will be consumed up util system can't allocate memory resource

Jay
  • 90
  • 10
  • First of all, if you're planning to do a sectioned download of the same file, you need to specify the byte range of the file to download as a header. Right now you're just downloading the same file from start over and over. Run ```curl -I http://file``` to see if "accept-ranges" is supported. Secondly, you're just downloading the data and not iterating over the acquired data/saving it. Take a look at [this](https://stackoverflow.com/a/16696317/7764138) – Xosrov Jan 15 '20 at 11:21
  • @Xosrov Thanks for reply, the same file links just for testing. – Jay Jan 16 '20 at 03:58
  • @Xosrov The most weird stuff is if change another website file link, everything is normal, my memory never growing as above issue. the other link: http://www.sse.com.cn/disclosure/listedinfo/announcement/c/2019-12-27/603818_2018_nA.pdf .I've racked my brains and still don't know why ,Any other suggestions? – Jay Jan 16 '20 at 04:02
  • Take a look on this answer: https://stackoverflow.com/a/33777090/2627487 – MrPisarik Jan 17 '20 at 18:08

2 Answers2

2

May be you can change your code like this, memory will not grow too much.

import requests
import shutil
import threading
from queue import Queue, Empty
from time import sleep

class MultiThreadsModule(object):

    def __init__(self, thread_num):
        self.print_lock = threading.Lock()
        self.compress_queue = Queue()
        self.thread_num = thread_num
        self.all_nodes = None
        self.func_main = None

    def thread_loop(self):
        self.thread_list = []
        for _ in range(self.thread_num):
            t = threading.Thread(target=self.process_queue)
            t.daemon = True
            t.start()
            self.thread_list.append(t)

    def node_queue(self, nodes):
        for i, node in enumerate(nodes):
            self.compress_queue.put(node)
        self.compress_queue.join()

    def process_queue(self):
        while True:
            try:
                node = self.compress_queue.get_nowait()
            except Empty:
                sleep(1)
                continue
            self.func_main(node)
            self.compress_queue.task_done()

    def run(self):
        self.node_queue(self.all_nodes)

def download_file_(url):
    with requests.get(url, stream=True, timeout=600) as r:
        return r.text

if __name__ == '__main__':
    mtm = MultiThreadsModule(20)
    mtm.all_nodes = ["https://www.sec.gov/Archives/edgar/data/913951/000095013399003276/0000950133-99-003276-d2.pdf"] * 1000
    mtm.func_main = download_file_
    mtm.thread_loop()
    mtm.run()
leafcoder
  • 46
  • 1
0

There are couple of possible causes, first

return r.text

Since you are requesting a binary file, you'd better use r.content. For r.text, requests will try to guess the encoding of your response then decode it, which could take tons of cpu cycles and ram. In my case, the unmodified code will stuck at r.text for quite awhile before response, and it's likely the root cause for your issue.

By changing the r.text to r.content, the \time command shows my memory usages are

1000 requests
    89055232  maximum resident set size

100 requests
    76300288  maximum resident set size

10 requests
    47607808  maximum resident set size

1 request
    25714688  maximum resident set size

The difference might be from mtm.all_nodes, and node_queue which stores all the urls in memory. The reason why you didn't hit this issue for "other url" might be the other url is returning a plain-text response, which is easier to decode for requests.

Mayli
  • 581
  • 4
  • 6
  • if function `download_file_` return nothing, the memory still growing like before. I think the most problem is why the gc didn't collect the memory? – Jay Jan 20 '20 at 01:22
  • Well, the growth of memory usage are almost same even if you replace the download_file_ with a `pass`. The `return r.text` is loading the binary into memory and trying to decode it during the execution. You can keep monitoring the CPU usage and you will find the process is busy decoding the binary, into a bigger str. – Mayli Jan 20 '20 at 20:15