4

i am having a little problem with the answer stated at Python progress bar and downloads

if the data downloaded was gzip encoded, the content length and the total length of the data after joining them in the for data in response.iter_content(): is different as in it is bigger cause automatically decompresses gzip-encoded responses

so the bar get longer and longer and once it become to long for a single line, it start flooding the terminal

a working example of the problem (the site is the first site i found on google that got both content-length and gzip encoding):

import requests,sys

def test(link):
    print("starting")
    response = requests.get(link, stream=True)
    total_length = response.headers.get('content-length')
    if total_length is None: # no content length header
        data = response.content
    else:
        dl = 0
        data = b""
        total_length = int(total_length)
        for byte in response.iter_content():
            dl += len(byte)
            data += (byte)
            done = int(50 * dl / total_length)
            sys.stdout.write("\r[%s%s]" % ('=' * done, ' ' * (50-done)))
            sys.stdout.flush()
    print("total data size: %s,  content length: %s" % (len(data),total_length))

test("http://www.pontikis.net/")

ps, i am on linux but it should effect other os too (except windows cause \r doesn't work on it iirc)

and i am using requests.Session for cookies (and gzip) handling so a solution with urllib and other module isn't what i am looking for

Community
  • 1
  • 1
freeforall tousez
  • 836
  • 10
  • 26
  • If your problem is that requests automatically uncompresses the data, you maybe shouldn't use requests. Since you're not doing any authentication the standard urllib.request should be probably fine. Then you can retrieve the data with a working progress bar and uncompress it with the zlib module when the file is completely downloaded. – Kritzefitz Feb 14 '14 at 12:59
  • i need to persist the cookies after logging in using a post request for what i am doing which is why i stated requests.Session and also why i stated solution with urllib isn't what i am looking for – freeforall tousez Feb 14 '14 at 13:15
  • Oh sorry. Didn't see that at the end. – Kritzefitz Feb 14 '14 at 13:28
  • You can use `response.raw` to account for the raw socket response from the server without all the handling that `iter_content` performs. – Michael Foukarakis Feb 19 '14 at 14:18

3 Answers3

0

Perhaps you should try disabling gzip compression or otherwise accounting for it.

The way to turn it off for requests (when using a session as you say you are):

import requests

s = requests.Session()
del s.headers['Accept-Encoding']

The header sent will now be: Accept-Encoding: Identity and the server should not attempt to use gzip compression. If instead you're trying to download a gzip-encoded file, you should not run into this problem. You will receive a Content-Type of application/x-gzip-compressed. If the website is gzip compressed, you'll receive a Content-Type of text/html for example and a Content-Encoding of gzip.

If the server always serves compressed content then you're out of luck, but no server should do that.


If you want to do something with the functional API of requests:

import requests

r = requests.get('url', headers={'Accept-Encoding': None})

Setting the header value to None via the functional API (or even in a call to session.get) removes that header from the requests.

Ian Stapleton Cordasco
  • 26,944
  • 4
  • 67
  • 72
  • this would fix the problem but it would make it a bigger bottle-neck, not only that, i don't think you can just use del on a normal requests.get to remove that header – freeforall tousez Feb 19 '14 at 00:47
  • @freeforalltousez unless you're downloading several Gigabyte long webpages, this shouldn't cause too much trouble. Further you asked for how to do it with a session but I've updated my answer with how to do the exact same thing on `requests.get`. If your question is no longer up-to-date with your requirements, update it. – Ian Stapleton Cordasco Feb 19 '14 at 12:38
  • i guess i will mark this answer for now, till another better one appear – freeforall tousez Feb 19 '14 at 18:13
0

You could replace...

dl += len(byte)

...with:

dl = response.raw.tell()

From the documentation:

tell(): Obtain the number of bytes pulled over the wire so far. May differ from the amount of content returned by :meth:HTTPResponse.read if bytes are encoded on the wire (e.g, compressed).

Nehal J Wani
  • 16,071
  • 3
  • 64
  • 89
0

Here is a simple process bar implement with tqdm:

def _reader_generator(reader):
    b = reader(1024 * 1024)
    while b:
        yield b
        b = reader(1024 * 1024)

def raw_newline_count_gzip(fname):
    f = gzip.open(fname, 'rb')
    f_gen = _reader_generator(f.read)
    return sum(buf.count(b'\n') for buf in f_gen)


num = raw_newline_count_gzip(fname)
(loop a gzip file):
    with tqdm(total=num_ids) as pbar:
        # do whatever you want
        pbar.update(1)

The bar looks like: 35%|███▌ | 26288/74418 [00:05<00:09, 5089.45it/s]

Diya Li
  • 1,048
  • 9
  • 21