0

Quick Summary:

I want to take a large txt.gz file (>20gb while compressed) that is hosted on a website, "open" it with gzip and then run itertools islice on it and slowly extract the lines from it. I don't believe that gzip can handle this natively.

The problem:

Libraries like urllib appear to download the entire binary data stream at once. The scripts I've found that use urllib or requests stream to a local file or variable after download and then decompress to read the text. I need to do this on the fly as the data set I am working with is too large. Also, since I want to iterate across lines of text this means that setting chunk sizes based on bytes won't always provide me with a clean line break in my data. My data will always be new-line delimited.

Example local code: (No url capability)

This works beautifully on disk with the following code.

from itertools import islice
import gzip

#Gzip file open call
datafile=gzip.open("/home/shrout/Documents/line_numbers.txt.gz")

chunk_size=2

while True:
    data_chunk = list(islice(datafile, chunk_size))
    if not data_chunk:
        break
    print(data_chunk)
    
datafile.close()

Example output from this script:

shrout@ubuntu:~/Documents$ python3 itertools_test.py 
[b'line 1\n', b'line 2\n']
[b'line 3\n', b'line 4\n']
[b'line 5\n', b'line 6\n']
[b'line 7\n', b'line 8\n']
[b'line 9\n', b'line 10\n']
[b'line 11\n', b'line 12\n']
[b'line 13\n', b'line 14\n']
[b'line 15\n', b'line 16\n']
[b'line 17\n', b'line 18\n']
[b'line 19\n', b'line 20\n']

Related Q&A's on Stack:

My problem with these Q&A's is that they never try to decompress and read the data as they are handling it. The data stays in a binary format as it is being written into a new, local file or a variable in the script. My data set is too large to fit in memory all at once and writing the original file to disk prior to reading it (again) would be a waste of time.

I can already use my example code to perform my tasks "locally" on a VM but I'm being forced over to object storage (minio) and docker containers. I need to find a way to basically create a file handle that gzip.open (or something like it) can use directly. I just need a "handle" that it is based on a URL. That may be a tall order but I figured this is the right place to ask... And I'm still learning a bit about this too so perhaps I've overlooked something simple. :)

-----Partial Solution-------

I'm working on this and found some excellent posts when I started searching differently. I have code that streams the gzipped file in chunks that can be decompressed, though breaking the data into line delimited strings is going to have additional processing cost. Not thrilled about that but I'm not sure what I'll be able to do about it.

New Code:

import requests
import zlib

target_url = "http://127.0.0.1:9000/test-bucket/big_data_file.json.gz"

#Using zlib.MAX_WBITS|32 apparently forces zlib to detect the appropriate header for the data
decompressor = zlib.decompressobj(zlib.MAX_WBITS|32)
#Stream this file in as a request - pull the content in just a little at a time
with requests.get (target_url, stream=True) as remote_file:
    #Chunk size can be adjusted to test performance
    for chunk in remote_file.iter_content(chunk_size=8192):     
        #Decompress the current chunk
        decompressed_chunk=decompressor.decompress(chunk)
        print(decompressed_chunk)

Helpful answers:

Will update with a final solution once I get it. Pretty sure this will be slow as molasses when compared to the local drive access I used to have!

Shrout1
  • 2,497
  • 4
  • 42
  • 65
  • Please share your attempted code that streams from a URL. – blhsing Aug 23 '21 at 22:14
  • @blhsing I can put up what I did with `urllib` but the problem with it is that it downloads the file in its entirety, which I can't afford to do. – Shrout1 Aug 24 '21 at 13:03
  • @blhsing I now have a partial solution. What remains is to iterate across the lines in the chunks and find a way to stitch together broken lines in a manner that isn't too computationally expensive. – Shrout1 Aug 24 '21 at 15:39

1 Answers1

0

This code will stream the target file in chunks, decompress it using zlib (so gz format or something similar) and then print out the lines. I haven't tested this for completeness on the final chunk of a file, so I may come back and revise. For the moment though, this accomplishes what I was looking for!

import requests
import zlib
from itertools import islice

#Be sure to have a MinIO bucket that has either public or download capabilties in order to use this script w/ MinIO
target_url = "http://127.0.0.1:9000/test-bucket/big_data_file.json.gz"

#Using zlib.MAX_WBITS|32 apparently forces zlib to detect the appropriate header for the data
decompressor = zlib.decompressobj(zlib.MAX_WBITS|32)
#Stream this file in as a request - pull the content in just a little at a time
with requests.get (target_url, stream=True) as remote_file:
    last_line="" #start this blank
    #Chunk size can be adjusted to test performance
    for chunk in remote_file.iter_content(chunk_size=1024):     
        #Decompress the current chunk
        decompressed_chunk=decompressor.decompress(chunk)
        #These characters are in "byte" format and need to be decoded to utf-8
        decompressed_chunk=decompressed_chunk.decode()
        #Append the "last line" to add any fragments from the last chunk - it is blank the first time around
        #This basically sticks line fragments from the last chunk onto the front of the current chunk.
        decompressed_chunk=last_line+decompressed_chunk
        #Run a split here; this is likely a costly step...
        split_chunk=list(decompressed_chunk.splitlines())
        #Pop the last line off the chunk since it isn't likely to be complete
        #We'll add it to the front of the next chunk
        last_line=split_chunk.pop()
        #We'll use islice for quick iteration across the data that's been pulled from the file
        for line in islice(split_chunk , 0, len(split_chunk)):
            #Data can be processed here, line by line.
            print(line)
Shrout1
  • 2,497
  • 4
  • 42
  • 65