2

Say that I have a massive list of dictionaries (2 million dictionaries). I need to essentially do a json.dumps() of each dictionary into a massive string (to put in the body of a request to AWS OpenSearch). So far I have this:

json_data = ''
action = {'index': {}}
for item in data:
    json_data += f'{json.dumps(action)}\n'
    json_data += f'{json.dumps(item)}\n'

where data is the large dictionary. This takes on average between 0.9 and 1 second. Is there a more efficient way to do this?

Other SO questions conclude that if this was a simple string addition that has to be done once, doing c = a + b is the fastest way, however, I have to keep appending to what in this case would be c. I have to repeat this operation many times, so speeding this up would be immensely helpful. Is there a way to speed up this function, and if so, what would those optimizations look like?

BrokenBenchmark
  • 18,126
  • 7
  • 21
  • 33
rodrigocf
  • 1,951
  • 13
  • 39
  • 62
  • Why not join the dictionaries then dump them? – MSH Jun 20 '22 at 15:24
  • 1
    `Thoughts?` → 1. If you already using python, just use `opensearchpy` library and try bulk imports into opensearch without converting to JSON-like string. 2. Is opensearch accepting 2 million record payloads? What is size of whole data? 3. multiprocessing library may be used to speed up data creation process – rzlvmp Jun 20 '22 at 15:39
  • Please clarify... How many items in *data* are you currently able to process in under 1 second ? – DarkKnight Jun 20 '22 at 15:43
  • @rzlvmp 1. Good call will give this a try 2. Lol, yes, each iteration is 2MM records, but the real data size is about 600MM records 3. Yes, i'm already doing threads of 2MM records of the 600MM records, I am still experimenting different thread counts and payload sizes – rodrigocf Jun 20 '22 at 15:46
  • @AlbertWinestein `data` is a list of 2MM dictionaries, which are processed in one second – rodrigocf Jun 20 '22 at 15:47
  • @rodrigocf I have constructed a list (*data*) with 2 millions small dictionaries. I have executed the code I showed in my answer. It takes 3.6 seconds. I'm running on a 3GHz 10-core Xeon. I'd be very interested to know what hardware you're using to achieve sub-second with a *data* list of that magnitude - especially with your woefully inefficient code – DarkKnight Jun 20 '22 at 15:52
  • @AlbertWinestein I'm running this on AWS Glue nodes: G.2X - 32 GB memory, 8vCPUs, 128 GB of EBS, which is probably why it is so fast. – rodrigocf Jun 20 '22 at 15:59

2 Answers2

3

Repeated string concatenation is slow. A better approach would be to build up a list of strings, and then join them at the end. I don't have access to your data, so I can't test this, but you'd be going for something along the lines of:

json_data = []
action = {'index': {}}
for item in data:
    json_data.append(action)
    json_data.append(item)
result = '\n'.join([json.dumps(blob) for blob in json_data])
BrokenBenchmark
  • 18,126
  • 7
  • 21
  • 33
0

Variation...

import json
json_data = []
action = json.dumps({'index': {}}) # dumps is only called on this once
for item in data:
    # json_data will be a list of strings
    json_data.append(action)
    json_data.append(json.dumps(item))
result = '\n'.join(json_data)
DarkKnight
  • 19,739
  • 3
  • 6
  • 22