2

I have been trying to download a zipped csv using the requests library from a server host URL.

When I download a smaller file not requiring compression from the same server it has no problem reading in the CSV, but with this one I return encoding errors.

I have tried multiple types of encoding, reading in as pandas csv, reading in as zip file and opening (at which point I get the error that file is not a zip file).

I have additionally tried using the zipfile library as sugggested here: Reading csv zipped files in python

and have also tried setting both encoding and compression in read_csv.

The code which works for the non-zipped server file is below:

response = requests.get(url, auth=HTTPBasicAuth(un, pw), stream=True, verify = False)
dfs = pd.read_csv(response.raw)

but returns 'utf-8' codec can't decode byte 0xfd in position 0: invalid start byte when used for this file.

I have also tried:

request = get(url, auth=HTTPBasicAuth(un, pw), stream=True, verify=False)
zip_file = ZipFile(BytesIO(request.content))
files = zip_file.namelist()
with gzip.open(files[0], 'rb') as csvfile:
    csvreader = csv.reader(csvfile)
    for row in csvreader:
        print(row)

which returns a seek attribute error.

visualnotsobasic
  • 428
  • 3
  • 17
  • https://stackoverflow.com/questions/39838026/pandas-read-csv-method-supports-zip-archive-reading-but-not-to-csv-method-su/71943718#71943718 This answer up here, helped me, hope it helps you. – Raul Maya Apr 20 '22 at 17:21

1 Answers1

4

Here is one way to do it:

import pandas as pd
import requests
from requests.auth import HTTPBasicAuth
from zipfile import ZipFile
import io

# Example dataset
url = 'https://www.stats.govt.nz/assets/Uploads/Retail-trade-survey/Retail-trade-survey-September-2020-quarter/Download-data/retail-trade-survey-september-2020-quarter-csv.zip'

response = requests.get(url, auth=HTTPBasicAuth(un, pw), stream=True, verify=False)
with ZipFile(io.BytesIO(response.content)) as myzip:
    with myzip.open(myzip.namelist()[0]) as myfile:
        df = pd.read_csv(myfile)

print(df)

If you want to read a specific csv in a multiple-csv zip file, replace myzip.namelist()[0] with the file you want to read. If you don't know its name, you can check the zip file content with print(ZipFile(io.BytesIO(response.content)))

dedede
  • 197
  • 11
  • Thanks for this! So I tried this guy, but I get BadZipFile: File is not a zip file. The url extension does not say zip nor csv, just indicates that it is a servlet link – visualnotsobasic Feb 08 '21 at 20:50
  • can you share the link or at least something similar to it? – dedede Feb 08 '21 at 21:30
  • something along the lines of https://server.data.com/ReportServer-1/FileDownloadServlet?reportId=1234567 Not a real URL obviously but basically the exact structure - as you can see no file or folder extensions – visualnotsobasic Feb 08 '21 at 21:40
  • does the file start downloading automatically when you click on the url? In either case, one way to do this is to scrape the page and look for the .zip file then use that as the `url`. – dedede Feb 08 '21 at 21:54
  • It does; I tried that and there's nothing on the page with any sort of file extension attribute – visualnotsobasic Feb 08 '21 at 22:05
  • try and add a request header `headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:66.0) Gecko/20100101 Firefox/66.0"}` then `requests.get(url, auth=HTTPBasicAuth(un, pw), stream=True, verify=False, headers=headers)` – dedede Feb 08 '21 at 22:25
  • sorry - read some documentation but not certain how 1) this helps and 2) what headers i should be passing – visualnotsobasic Feb 08 '21 at 23:26