Downloading csv files from a URL - trying to add file extensions before calling urlretrieve is causing pandas to raise BadZipFile

Question

I am trying to download multiple .csv files from a URL where the only variable is the year that needs adding to the end of the constant URL string. I noticed that when I download manually from the website in question the .csv file extension is added automatically and there have been no issues using it with pandas. However, I would like a def function that can automate this downloading process and I would prefer to have .zip (or .gzip for Linux, which I have already created a method for changing the extension) files due to the size of them (20+GB total across all 23 files).

However my issue is that at the moment I am putting the save directory (year_save_dir) with .csv.zip added at the end as otherwise, there is no extension on the downloaded file (and it is significantly smaller so I am assuming it is meant to be a .zip file). When I try to read the file using pandas with the .csv.zip file extension it returns this error:

   File "C:\Python37\lib\zipfile.py", line 1325, in _RealGetContents
    raise BadZipFile("File is not a zip file")
zipfile.BadZipFile: File is not a zip file

I understand why there is an error, but I am unsure how to fix it. If I download the files with no added extension in year_save_dir = it looks like this:

With .csv.zip it appears like this:

With the contents of this .csv.zip looking like (true size of ~1.75GB, reading this with pandas returns the above error):

When I add just .zip it compresses fine but then returns just a file with no extensions:

How can I download the files with compressed extensions (.zip in this case) but also add the needed .csv extension to the file, is this possible or is there a better solution/method available? I am using Python 3.7 if that makes a difference.

Code:

import os
import urllib.request as ureq

import pandas as pd


EA_URL = ('https://environment.data.gov.uk/water-quality/batch/' +
                  'measurement?year=')

DEFAULT_YEARS = [x for x in range(2000, 2002)]


def csv_download(save_dir: str, years: list = DEFAULT_YEARS) -> None:
    for year in years:
        year = str(year)
        print("Started year {}".format(year))
        # Changing the below line is what is causing issues
        year_save_dir = save_dir + year + '.csv.zip'
        env_url = EA_URL + year
        ureq.urlretrieve(env_url, year_save_dir)
    return


def pandas_breaks(save_dir):
    # This function fails
    for file in os.listdir(save_dir):
        df = pd.read_csv(save_dir + file)


def main():
    save_dir = 'C:/Users/Acer/Downloads/Testing/'
    if not os.path.isdir(save_dir):
        os.makedirs(save_dir)
    csv_download(save_dir)
    pandas_breaks(save_dir)

if __name__ == '__main__':
    main()

Edit: I have experimented with an answer to How can I replace (or strip) an extension from a filename in Python? but this actually removes the .csv from the zipped file too (when replacing .csv.zip with just .zip) so it appears that the .csv is essential in the file extension. It is not attached to the actual downloaded file unless it is also in zipped file name (.csv.zip). I got the message return from urlretrieve and it gives this:

x-amz-id-2: vUOH963SZ6x+NBjj02vFIFmpgzBPfxhIvZLSE+qcKcfeJzlfFwZQdq8OvWgazQeXrupowH9OxtI=
x-amz-request-id: YE1P1QBQMXGK6E9J
Date: Tue, 13 Sep 2022 18:21:35 GMT
Last-Modified: Tue, 13 Sep 2022 05:03:18 GMT
ETag: "74e84c081cb7fbe5fc0ad4850fc38d51-7"
Content-Encoding: gzip
Accept-Ranges: bytes
Content-Type: text/csv
Server: AmazonS3
Content-Length: 57932207
Connection: close

So it appears to be a .gzip extension. I have no idea why when I download it from the website, it downloads correctly as a .csv file but when using Python, it comes with no extension and compressed. Does this extra information help at all?

score 0 · Accepted Answer · answered Sep 14 '22 at 15:03

The edit appears to have given away how to successfully download the .csv's without corruption and then unzip. The answer to How to decompress a GZIP file to an uncompressed file on disk? by Charles Duffy was used to create a function that could turn the .gzip files into uncompressed .csv's, the edit to the question showing Content-Encoding: gzip was key as the module zipfile was not working.

The output looks like this now and matches the sizes of the downloaded .csv files in another folder:

More testing is required to compare the downloaded files against the manually downloaded files to ensure the method is not corrupting/changing data but this is a solution. The working code:

import os
import gzip
import shutil
import urllib.request as ureq

import pandas as pd


EA_URL = ('https://environment.data.gov.uk/water-quality/batch/' +
                  'measurement?year=')

DEFAULT_YEARS = [x for x in range(2000, 2002)]


def csv_download(save_dir: str, years: list = DEFAULT_YEARS) -> None:
    for year in years:
        year = str(year)
        print("Started year {}".format(year))
        gzip_save_dir = save_dir + year + '.gzip'
        csv_save_dir = save_dir + year + '.csv'
        env_url = EA_URL + year
        _, message = ureq.urlretrieve(env_url, gzip_save_dir)
        print("Started decompressing year {}".format(year))
        gzip_decompress(gzip_save_dir, csv_save_dir)
    pandas_breaks(save_dir)
    return


def pandas_breaks(save_dir):
    for file in os.listdir(save_dir):
        if file.endswith('.csv'):
            df = pd.read_csv(save_dir + file, nrows=10)
            print(df)


def gzip_decompress(gzip_name, csv_name):
    with gzip.open(gzip_name, 'r') as f_in, open(csv_name, 'wb') as f_out:
        shutil.copyfileobj(f_in, f_out)


def main():
    save_dir = 'C:/Users/Acer/Downloads/Testing/'
    if not os.path.isdir(save_dir):
        os.makedirs(save_dir)
    csv_download(save_dir)
    pandas_breaks(save_dir)

if __name__ == '__main__':
    main()

I won't accept this answer for a few days in case anyone has any better solutions.

Downloading csv files from a URL - trying to add file extensions before calling urlretrieve is causing pandas to raise BadZipFile

1 Answers1