1

Using Python 3.x, I need to extract JSON objects from a large file (>5GB), read as a stream. The file is stored on S3 and I dont want to load the entire file into memory for processing. Therefore I read chunks of data with amt=10000 (or some other chunk-size).

The data is in this format

{
object-content
}{
object-content
}{
object-content
}

...and so on.

To manage this, I have tried a few things, but the only working solution I have is to read the chunks-piece by piece, and look for "}". For every "}" I try to convert to json with json.load(), the moving window of indexes. If it fails, pass and move to next "}". If success, yield object and update indexes.

def streamS3File(s3objGet):

    chunk = ""
    indexStart = 0 # used to find starting point of a moving window of text where JSON-object starts
    indexStop = 0 # used to find stopping point of a moving window of text where JSON-object stops

    while True:
        # Get a new chunk of data
        newChunk = s3objGet["Body"].read(amt=100000).decode("utf-8")
        # If newChunk is zero, we are at the end of the file
        if len(newChunk) == 0:
            raise StopIteration
        # Add to the leftover from last chunk
        chunk = chunk + newChunk

        # Look for "}". For every "}", try to convert the part of the chunk
        # to JSON. If it fails, pass and look for the next "}".
        for m in re.finditer('[\{\}]', chunk):
            if m.group(0) == "}":
                try:
                    indexStop = m.end()
                    yield json.loads(chunk[indexStart:indexStop])
                    indexStart = indexStop
                except:
                    pass
        # Remove the part of the chunk allready processed and returned as objects
        chunk = chunk[indexStart:]
        # Reset indexes
        indexStart = 0
        indexStop = 0

for t in streamS3File(s3ReadObj):
    # t is the json-object found
    # do something with it here

I would like input on other ways to accomplish this: Finding json-objects in a stream of text and extracting the json-objects as they pass by.

martineau
  • 119,623
  • 25
  • 170
  • 301
Jørgen Frøland
  • 364
  • 3
  • 13

0 Answers0