Quick Summary:
I want to take a large txt.gz file (>20gb while compressed) that is hosted on a website, "open" it with gzip and then run itertools
islice
on it and slowly extract the lines from it. I don't believe that gzip can handle this natively.
The problem:
Libraries like urllib
appear to download the entire binary data stream at once. The scripts I've found that use urllib
or requests
stream to a local file or variable after download and then decompress to read the text. I need to do this on the fly as the data set I am working with is too large. Also, since I want to iterate across lines of text this means that setting chunk sizes based on bytes won't always provide me with a clean line break in my data. My data will always be new-line delimited.
Example local code: (No url capability)
This works beautifully on disk with the following code.
from itertools import islice
import gzip
#Gzip file open call
datafile=gzip.open("/home/shrout/Documents/line_numbers.txt.gz")
chunk_size=2
while True:
data_chunk = list(islice(datafile, chunk_size))
if not data_chunk:
break
print(data_chunk)
datafile.close()
Example output from this script:
shrout@ubuntu:~/Documents$ python3 itertools_test.py
[b'line 1\n', b'line 2\n']
[b'line 3\n', b'line 4\n']
[b'line 5\n', b'line 6\n']
[b'line 7\n', b'line 8\n']
[b'line 9\n', b'line 10\n']
[b'line 11\n', b'line 12\n']
[b'line 13\n', b'line 14\n']
[b'line 15\n', b'line 16\n']
[b'line 17\n', b'line 18\n']
[b'line 19\n', b'line 20\n']
Related Q&A's on Stack:
- Read a gzip file from a url with zlib in Python 2.7
- Stream a large file from URL straight into a gzip file
My problem with these Q&A's is that they never try to decompress and read the data as they are handling it. The data stays in a binary format as it is being written into a new, local file or a variable in the script. My data set is too large to fit in memory all at once and writing the original file to disk prior to reading it (again) would be a waste of time.
I can already use my example code to perform my tasks "locally" on a VM but I'm being forced over to object storage (minio) and docker containers. I need to find a way to basically create a file handle that gzip.open
(or something like it) can use directly. I just need a "handle" that it is based on a URL. That may be a tall order but I figured this is the right place to ask... And I'm still learning a bit about this too so perhaps I've overlooked something simple. :)
-----Partial Solution-------
I'm working on this and found some excellent posts when I started searching differently. I have code that streams the gzipped file in chunks that can be decompressed, though breaking the data into line delimited strings is going to have additional processing cost. Not thrilled about that but I'm not sure what I'll be able to do about it.
New Code:
import requests
import zlib
target_url = "http://127.0.0.1:9000/test-bucket/big_data_file.json.gz"
#Using zlib.MAX_WBITS|32 apparently forces zlib to detect the appropriate header for the data
decompressor = zlib.decompressobj(zlib.MAX_WBITS|32)
#Stream this file in as a request - pull the content in just a little at a time
with requests.get (target_url, stream=True) as remote_file:
#Chunk size can be adjusted to test performance
for chunk in remote_file.iter_content(chunk_size=8192):
#Decompress the current chunk
decompressed_chunk=decompressor.decompress(chunk)
print(decompressed_chunk)
Helpful answers:
Will update with a final solution once I get it. Pretty sure this will be slow as molasses when compared to the local drive access I used to have!