6

I am requesting a csv file that's gzipped.

How do I uncompress that file and convert it to a csv object?

csv_gz_file = get("example.com/filename.csv.gz", headers=csv_headers, timeout=30, stream=True)

reader = csv.reader(csv_gz_file)
for row in reader:
   print row

And it throws this because it's not unzipped

_csv.Error: line contains NULL byte
Bhargav Rao
  • 50,140
  • 28
  • 121
  • 140
Tim Nuwin
  • 2,775
  • 2
  • 29
  • 63

1 Answers1

10
import gzip
import io
import requests

web_response = requests.get("example.com/filename.csv.gz", headers=csv_headers,
                            timeout=30, stream=True)
csv_gz_file = web_response.content # Content in bytes from requests.get
                                   # See comments below why this is used.

f = io.BytesIO(csv_gz_file)
with gzip.GzipFile(fileobj=f) as fh:
    # Passing a binary file to csv.reader works in PY2
    reader = csv.reader(fh)
    for row in reader:
        print(row)

By saving the gz data in memory, extract it using the gzip module and then read the plaintext data into another memory container and finally, open that container with your reader.

I'm slightly unsure on how csv.reader expects a file handle or a list of data, but I'd assume this would work. If not simply do:

reader = csv.reader(csv_content.splitlines())

And that should do the trick.

Torxed
  • 22,866
  • 14
  • 82
  • 131
  • I see, thank you. It throws an error w/ the bytes having a 2nd argument: File "csv_processor.py", line 53, in f = io.BytesIO(bytes(csv_gz_file, 'UTF-8')) TypeError: str() takes at most 1 argument (2 given) – Tim Nuwin Jun 08 '16 at 14:17
  • @TimNuwin I'm using Python3 (as you probably should if you don't know any reason not to) - There you need to define the encoding `bytes(str, enc)` it should be using. Simply remove `, 'utf-8'` from the `bytes()` function in this case. – Torxed Jun 08 '16 at 14:22
  • Yeah.. I am unfortunately running 2.7. This is what happens if I remove that UTF-8 encoding argument Traceback (most recent call last): File "csv_processor", line 56, in csv_content = fh.read() ... raise IOError, 'Not a gzipped file' IOError: Not a gzipped file --- I am able to download the file directly from the link and decompress it properly though. – Tim Nuwin Jun 08 '16 at 14:25
  • @TimNuwin perhaps a `print([csv_gz_file])` will give you a clue as to what's wrong with the filecontent. I'm guessing `csv_gz_file` isn't in fact a gzip file but a list or the data includes the web header (header+data)? `import requests` is not a default module of Python 2.7.11 at least so I wouldn't know what it returns : ) – Torxed Jun 08 '16 at 14:27
  • I see. It prints this: [] – Tim Nuwin Jun 08 '16 at 14:29
  • Sorry I'm a noob with python. I was actually using get to retrieve the file: and here's the top of python file: from requests import get, post, Timeout – Tim Nuwin Jun 08 '16 at 14:30
  • 1
    @TimNuwin Good, my answer will hold up assuming you fix the initial problem. That is however a completely different topic and I suggest you take a look at others that has solved that bit for you. For instance https://stackoverflow.com/questions/22676/how-do-i-download-a-file-over-http-using-python or https://stackoverflow.com/questions/19602931/basic-http-file-downloading-and-saving-to-disk-in-python. – Torxed Jun 08 '16 at 14:37
  • 1
    @TimNuwin Also welcome to Python and sorta StackOverflow? : ) – Torxed Jun 08 '16 at 14:37
  • `bytes(csv_gz_file, 'UTF-8')` should not even function, at least if *csv_gz_file* is a [`Response`](http://docs.python-requests.org/en/v0.10.6/api/#requests.Response) object. – Ilja Everilä Jun 08 '16 at 15:26
  • @IljaEverilä This is well pointed out in these comments and as mentioned - he didn't specify what lib used to `requests.get` - it could have been a custom class. – Torxed Jun 08 '16 at 15:31
  • Could've, but requests is *quite* popular and **your** answer clearly is using it, so why the broken code? Gzipped data is nothing like utf-8 encoded text, so don't treat it as such. Use the *content* attribute of a response instead, which holds the binary content. `csv_fh = io.StringIO(csv_content)` will also blow up btw, since *csv_content* is bytes read from a gzip file. – Ilja Everilä Jun 08 '16 at 15:43
  • @IljaEverilä The question was `How do I uncompress that file and convert it to a csv object?` and not how to use the `requests` lib. I get your point, and as i pointed out in the comments I have no interest or knowledge on how the request module works. So assuming `csv_gz-File` is actually data containing the CSV file, my answer would hold up. I simply used the variable name and function provided by the user. I'll also point out that my code **doesn't** use `requests`, that's an assumption on your part just as we both assume now that the OP uses it because it's popular. – Torxed Jun 08 '16 at 15:48
  • @IljaEverilä However, I'll update my answer based on your feedback and lets **hope** the user actually uses the "popular" framework `requests`. – Torxed Jun 08 '16 at 15:49
  • Fair enough, OP might not be using [requests](http://docs.python-requests.org/en/master/), though the calling convention is spot on and the error he has is the exact error you'd get in python 2, if you pass a [`Response`](http://docs.python-requests.org/en/v0.10.6/api/#requests.Response) as is to `csv.reader`... – Ilja Everilä Jun 08 '16 at 15:55
  • @IljaEverilä I totally agree. But the number of times I've gotten downvoted into oblivion or bashed publicly for assuming things are staggeringly overwhelming comparing to the times people I've kept a post strictly to the topic and letting these assumptions go and instead pointing out "assuming X contains Y". It's a matter of how the guidelines for this community should be updated/enforced/exerted. I just go by statistics in this case, if the down vote in this case is yours and you still feel that my updated answer doesn't meet your requirements, please let me know and i'll improve further. – Torxed Jun 08 '16 at 15:58
  • 2
    It's true that the question could've used editing from OP since it does not explicitly define where `get` came from. The down vote is from me, as your answer still mixes bytes and text: a `GzipFile` returns bytes when read and a `StringIO` expects text. Either the bytes have to be decoded or the `GzipFile` could be wrapped with `io.TextIOWrapper` that decodes it while reading. – Ilja Everilä Jun 08 '16 at 16:05
  • 1
    @IljaEverilä Seeing as `csv.reader` can handle the `bytes` data in Python2 without a type conversion (essentially a string anyway) I removed the `StringIO` line. – Torxed Jun 08 '16 at 16:14
  • 1
    As @IljaEverilä suggested, I had to use `reader = csv.reader(io.TextIOWrapper(fh, 'utf8'))` to avoid getting a `_csv.Error: iterator should return strings, not bytes (did you open the file in text mode?)` error. – juniper- Sep 24 '18 at 15:58