Using Python 3.x, I need to extract JSON objects from a large file (>5GB), read as a stream. The file is stored on S3 and I dont want to load the entire file into memory for processing. Therefore I read chunks of data with amt=10000 (or some other chunk-size).
The data is in this format
{
object-content
}{
object-content
}{
object-content
}
...and so on.
To manage this, I have tried a few things, but the only working solution I have is to read the chunks-piece by piece, and look for "}". For every "}" I try to convert to json with json.load(), the moving window of indexes. If it fails, pass and move to next "}". If success, yield object and update indexes.
def streamS3File(s3objGet):
chunk = ""
indexStart = 0 # used to find starting point of a moving window of text where JSON-object starts
indexStop = 0 # used to find stopping point of a moving window of text where JSON-object stops
while True:
# Get a new chunk of data
newChunk = s3objGet["Body"].read(amt=100000).decode("utf-8")
# If newChunk is zero, we are at the end of the file
if len(newChunk) == 0:
raise StopIteration
# Add to the leftover from last chunk
chunk = chunk + newChunk
# Look for "}". For every "}", try to convert the part of the chunk
# to JSON. If it fails, pass and look for the next "}".
for m in re.finditer('[\{\}]', chunk):
if m.group(0) == "}":
try:
indexStop = m.end()
yield json.loads(chunk[indexStart:indexStop])
indexStart = indexStop
except:
pass
# Remove the part of the chunk allready processed and returned as objects
chunk = chunk[indexStart:]
# Reset indexes
indexStart = 0
indexStop = 0
for t in streamS3File(s3ReadObj):
# t is the json-object found
# do something with it here
I would like input on other ways to accomplish this: Finding json-objects in a stream of text and extracting the json-objects as they pass by.