5

I am reading McKinney's Data Analysis book, and he has shared 150MB file. Although this topic has been discussed extensively at Progress Bar while download file over http with Requests, I am finding that the code in accepted answer is throwing an error. I am a beginner, so I am unable to resolve this.

I want to download the following file:

https://raw.githubusercontent.com/wesm/pydata-book/2nd-edition/datasets/fec/P00000001-ALL.csv

Here's the code without progress bar:

DATA_PATH='./Data'
filename = "P00000001-ALL.csv"
url_without_filename = "https://raw.githubusercontent.com/wesm/pydata-book/2nd-edition/datasets/fec"

url_with_filename = url_without_filename + "/" + filename
local_filename = DATA_PATH + '/' + filename

#Write the file on local disk
r = requests.get(url_with_filename)  #without streaming
with open(local_filename, 'w', encoding=r.encoding) as f:
    f.write(r.text)

This works well, but because there is no progress bar, I wonder what's going on.

Here's the code adapted from Progress Bar while download file over http with Requests and How to download large file in python with requests.py?

#Option 2:
#Write the file on local disk
r = requests.get(url_with_filename, stream=True)  # added stream parameter
total_size = int(r.headers.get('content-length', 0))

with open(local_filename, 'w', encoding=r.encoding) as f:
    #f.write(r.text)
    for chunk in tqdm(r.iter_content(1024), total=total_size, unit='B', unit_scale=True):
        if chunk:
            f.write(chunk)

There are two problems with the second option (i.e. with streaming and tqdm package):

a) The file size isn't calculated correctly. The actual size is 157MB, but the total_size turns out to be 25MB.

b) Even bigger problem than a) is that I get the following error:

 0%|          | 0.00/24.6M [00:00<?, ?B/s] Traceback (most recent call last):   File "C:\Anaconda3\lib\site-packages\IPython\core\interactiveshell.py", line 3265, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)   File "<ipython-input-31-abbe9270092b>", line 6, in <module>
    f.write(data) TypeError: write() argument must be str, not bytes

As a beginner, I am unsure how to solve these two issues. I spent a lot of time going through git page of tqdm, but I couldn't follow it. I'd appreciate any help.


I am assuming that the readers know that we need to import requests and tqdm. So, I haven't included the code for importing these basic packages.


Here's the code for those who are curious:

with open(local_filename, 'wb') as f:
    r = requests.get(url_with_filename, stream=True)  # added stream parameter
    # total_size = int(r.headers.get('content-length', 0))
    local_filename = DATA_PATH + '/' + filename
    total_size = len(r.content)
    downloaded = 0
    # chunk_size = max(1024*1024,int(total_size/1000))
    chunk_size = 1024
    #for chunk in tqdm(r.iter_content(chunk_size=chunk_size),total=total_size,unit='KB',unit_scale=True):
    for chunk in r.iter_content(chunk_size=chunk_size):
        downloaded += len(chunk)
        a=f.write(chunk)
        done = int(50 * downloaded/ total_size)
        sys.stdout.write("\r[%s%s]" % ('=' * done, ' ' * (50 - done)))
        sys.stdout.flush()
watchtower
  • 4,140
  • 14
  • 50
  • 92

3 Answers3

1

As the error says :

write() argument must be str, not bytes

so just convert chunk to string :

f.write(str(chunk))

Note: Instead I would suggest to write to a .bin file and then convert it to .csv

Marco D.G.
  • 2,317
  • 17
  • 31
0

Try writing with wb instead of just w.

with open( local_filename, 'wb', encoding= r.encoding ) as f:
    f.write( r.text )
Meghdeep Ray
  • 5,262
  • 4
  • 34
  • 58
  • I just checked. Both issues persist. Also, the issue isn't about plainly downloading the file, but about using progress bar and calculating the size correctly. – watchtower Oct 12 '18 at 07:26
0
with open(filename, 'wb', encoding=r.encoding) as f:
    f.write(r.content)

This should fix your writing problem. Write r.content not r.text Since type(r.content) is <class 'bytes'> which is what you need to write in the file

HaR
  • 987
  • 7
  • 23