I am trying to download multiple .csv files from a URL where the only variable is the year that needs adding to the end of the constant URL string. I noticed that when I download manually from the website in question the .csv file extension is added automatically and there have been no issues using it with pandas
. However, I would like a def
function that can automate this downloading process and I would prefer to have .zip (or .gzip for Linux, which I have already created a method for changing the extension) files due to the size of them (20+GB total across all 23 files).
However my issue is that at the moment I am putting the save directory (year_save_dir
) with .csv.zip
added at the end as otherwise, there is no extension on the downloaded file (and it is significantly smaller so I am assuming it is meant to be a .zip file). When I try to read the file using pandas
with the .csv.zip
file extension it returns this error:
File "C:\Python37\lib\zipfile.py", line 1325, in _RealGetContents
raise BadZipFile("File is not a zip file")
zipfile.BadZipFile: File is not a zip file
I understand why there is an error, but I am unsure how to fix it. If I download the files with no added extension in year_save_dir =
it looks like this:
With .csv.zip
it appears like this:
With the contents of this .csv.zip
looking like (true size of ~1.75GB, reading this with pandas
returns the above error):
When I add just .zip
it compresses fine but then returns just a file with no extensions:
How can I download the files with compressed extensions (.zip
in this case) but also add the needed .csv
extension to the file, is this possible or is there a better solution/method available? I am using Python 3.7 if that makes a difference.
Code:
import os
import urllib.request as ureq
import pandas as pd
EA_URL = ('https://environment.data.gov.uk/water-quality/batch/' +
'measurement?year=')
DEFAULT_YEARS = [x for x in range(2000, 2002)]
def csv_download(save_dir: str, years: list = DEFAULT_YEARS) -> None:
for year in years:
year = str(year)
print("Started year {}".format(year))
# Changing the below line is what is causing issues
year_save_dir = save_dir + year + '.csv.zip'
env_url = EA_URL + year
ureq.urlretrieve(env_url, year_save_dir)
return
def pandas_breaks(save_dir):
# This function fails
for file in os.listdir(save_dir):
df = pd.read_csv(save_dir + file)
def main():
save_dir = 'C:/Users/Acer/Downloads/Testing/'
if not os.path.isdir(save_dir):
os.makedirs(save_dir)
csv_download(save_dir)
pandas_breaks(save_dir)
if __name__ == '__main__':
main()
Edit: I have experimented with an answer to How can I replace (or strip) an extension from a filename in Python? but this actually removes the .csv from the zipped file too (when replacing .csv.zip
with just .zip
) so it appears that the .csv is essential in the file extension. It is not attached to the actual downloaded file unless it is also in zipped file name (.csv.zip). I got the message return from urlretrieve
and it gives this:
x-amz-id-2: vUOH963SZ6x+NBjj02vFIFmpgzBPfxhIvZLSE+qcKcfeJzlfFwZQdq8OvWgazQeXrupowH9OxtI=
x-amz-request-id: YE1P1QBQMXGK6E9J
Date: Tue, 13 Sep 2022 18:21:35 GMT
Last-Modified: Tue, 13 Sep 2022 05:03:18 GMT
ETag: "74e84c081cb7fbe5fc0ad4850fc38d51-7"
Content-Encoding: gzip
Accept-Ranges: bytes
Content-Type: text/csv
Server: AmazonS3
Content-Length: 57932207
Connection: close
So it appears to be a .gzip
extension. I have no idea why when I download it from the website, it downloads correctly as a .csv file but when using Python, it comes with no extension and compressed. Does this extra information help at all?