It seems like a question that should have already been asked before, but I couldn't find it. So here it goes.
The data:
- a master list
( length ~= 16,000,000 ) of sublists ( each has length upto 500 items ) of str
The aim:
- to shuffle each of the sublists within the master list efficiently.
I have attempted the straight for
-loop, list comprehension, pandas Series.apply()
, pandarallel
, and dask
dataframe .apply()
and .map_partition()
methods.
A for
-loop takes about 15 minutes.pd.series.apply()
, dask.series.apply()
, and dask.series.map_partition()
all managed to do it just over 6 minutes.
My question is "can I achieve the shuffling faster"? Both producing a new copy or shuffling in place are acceptable.
Below is my attempt :
def normal_shuffle(series):
output = series.tolist()
length = len(output)
for i in range(length):
random.Random().shuffle(output[i])
return output
def shuffle_returned(a_list):
new_list = a_list
random.shuffle(new_list)
return new_list
def shuffle_partition(a_partition):
return a_partition.apply(shuffle_returned)
%time shuffled_for = normal_shuffle(test_series)
%time shuffled_apply = test_series.apply(shuffle_returned)
pandarallel.initialize(progress_bar=False, nb_workers=8)
%time shuffled_parallel_apply = test_series.parallel_apply(shuffle_returned)
test_ddf = ddf.from_pandas(test_series, npartitions=16)
test_ddf = test_ddf.reset_index(drop=True)
shuffled_ddf = test_ddf.apply(shuffle_returned, meta="some_str")
%time shuffled_ddf.persist()
shuffled_by_parttion_ddf = test_ddf.map_partitions(shuffle_partition, meta="productId")
%time shuffled_by_parttion_ddf.persist()
Now I try to use dask distributed to see, if I can somehow stagger the model training and data shuffling so that the training time and shuffling time overlaps and achieve a better overall time efficiency.
I would very much appreciate any feedback or suggestion on how I can make it shuffling operation more efficient.
UPDATE
Having tried some of the suggestions, the following turned out to be fastest I could achieve, which is also surprisingly simple!
%time [np.random.shuffle(x) for x in alist]
CPU times: user 23.7 s, sys: 160 ms, total: 23.9 s
Wall time: 23.9 s
Single thread numpy is the way to go here, it seems!