2

I have json files on s3 like:

{'key1':value1, 'key2':'value2'}{'key1':value1, 'key2':'value2'}{'key1':value1, 'key2':'value2'}
{'key1':value1, 'key2':'value2'}{'key1':value1, 'key2':'value2'}{'key1':value1, 'key2':'value2'}

The structure is not an array, concatenated jsons without any newlines. There are 1000s of files from which I need only a couple of fields. How can I process them fast?

I will use this on AWS Lambda. The code I am thinking of is somewhat like this:

data_chunk = data_file.read()
recs = data_chunk.split('}')
json_recs = []
# This part onwards it becomes inefficient where I have to iterate every record
for rec in recs:
    json_recs.append(json.loads(rec + '}'))
    # Extract Individual fields

How can this be improved? Will using Pandas dataframe help? Individual files are small about 128 MB.

user 923227
  • 2,528
  • 4
  • 27
  • 46
  • What kind of *process* you're trying to do? The first few services I can think of are EMR, Lambda and Glue. You're to use one of them, or you're going to decide which service is suitable. I could guess it is Lambda as you mention Python here, but it's better you utter. – vahdet Mar 01 '19 at 08:21
  • is there a line delimiter or all json concatenated in a single line – omuthu Mar 01 '19 at 09:57

1 Answers1

1

S3 Select supports this JSON Lines structure. You can query it with a SQL-like langugage. It's fast and cheap.

Milan Cermak
  • 7,476
  • 3
  • 44
  • 59
  • Sorry, I'm not sure what do you mean by "a prefix" in this context. Can you elaborate? – Milan Cermak Mar 01 '19 at 20:54
  • Here the code goes like: `s3.select_object_content( Bucket=src_bucket, Key=src_key,` where `src_key` is an exact object like `/some/object/path/file` a `prefix` would be `/some/object/path/` where there are multiple objects in that path. Tried with a `Prefix` in place of `Key` it did not work! – user 923227 Mar 01 '19 at 20:58
  • 1
    No, but you can use Athena to do that https://stackoverflow.com/questions/51312541/does-aws-s3-select-work-with-multiple-files – Milan Cermak Mar 01 '19 at 21:04
  • 1
    This is a much cleaner code to use however w.r.t to performance processing 6033 files took nearly 22/23 mins similar to manually processing 20 mins. – user 923227 Mar 01 '19 at 22:42
  • I guess it's not applicable to your current situation, but in case you'll design something similar in the future - having the files compressed in storage helps a lot with the query performance. – Milan Cermak Mar 01 '19 at 23:11