This question is similar to (but I believe different from) Efficient way to shuffle lists within a list in Python
My input: a JSON lines file (approx 10M lines) where each line is a JSON representation of a dictionary with a primary key guaranteed to contain a unique value for each line, and another key whose value is a list (length between 1 and 500 items; mean is ~15) that I need to shuffle, along with other key/value pairs, e.g.,
{'primary_key': 'some_unique_value', 'the_list': <list (of strings)>, 'foo': <string>, 'bar': <int>}
My need: a Python dictionary whose keys are the ~10M unique values of the input file's primary key, and whose value is a dictionary containing the shuffled lists, e.g.,
{'some_unique_value': {'the_list': <sorted version of list (of strings)>, 'foo': <string>, 'bar': <int>}}
Naive approach:
mydict = {}
with smartopen (myfile) as inf:
for line in inf:
j_dict = json.loads(line.rstrip())
key = j_dict.pop('primary_key')
mydict[key] = j_dict
random.shuffle(mydict[key]['the_list'])
Smartopen is streaming the file in through a decompression pipe.
Based on the question linked above, it looks like I might be better off with np.random.shuffle()
. But if I'm reading that thread correctly, it looks like the poster also got a speedup by using list comprehension rather than a for loop. Am I likely to get a benefit from figuring out how to write this as a dictionary comprehension in this case where I'm probably limited by I/O speed streaming the decompression in from disk?
I have an HPC computing array at my disposal so if there's a good way to parallelize the shuffle operations, I have the hardware to support that. I don't have experience with parallelizing Python code at all (yet!). But it seems like it could be an embarrassingly parallel problem if I could figure out how to set it up? (In that case, maybe i do need to rewrite as a dictionary comprehension?)
I haven't benchmarked the different options since I'm not sure how I'd even express what I need in a dictionary comprehension, and I'm not sure what options might exist to parallelize the creation of a dictionary (since all the examples I can find use lists).