7

It seems like a question that should have already been asked before, but I couldn't find it. So here it goes.

The data:
- a master list ( length ~= 16,000,000 ) of sublists ( each has length upto 500 items ) of str

The aim:
- to shuffle each of the sublists within the master list efficiently.

I have attempted the straight for-loop, list comprehension, pandas Series.apply(), pandarallel, and dask dataframe .apply() and .map_partition() methods.

A for-loop takes about 15 minutes.
pd.series.apply(), dask.series.apply(), and dask.series.map_partition() all managed to do it just over 6 minutes.

My question is "can I achieve the shuffling faster"? Both producing a new copy or shuffling in place are acceptable.

Below is my attempt :

def normal_shuffle(series):
    output = series.tolist()
    length = len(output)
    for i in range(length):
        random.Random().shuffle(output[i])
    return output

def shuffle_returned(a_list):
    new_list = a_list
    random.shuffle(new_list)
    return new_list

def shuffle_partition(a_partition):
    return a_partition.apply(shuffle_returned)

%time shuffled_for = normal_shuffle(test_series)
%time shuffled_apply = test_series.apply(shuffle_returned)

pandarallel.initialize(progress_bar=False, nb_workers=8)
%time shuffled_parallel_apply = test_series.parallel_apply(shuffle_returned)

test_ddf = ddf.from_pandas(test_series, npartitions=16)
test_ddf = test_ddf.reset_index(drop=True)

shuffled_ddf = test_ddf.apply(shuffle_returned, meta="some_str")
%time shuffled_ddf.persist()

shuffled_by_parttion_ddf = test_ddf.map_partitions(shuffle_partition, meta="productId")
%time shuffled_by_parttion_ddf.persist()

Now I try to use dask distributed to see, if I can somehow stagger the model training and data shuffling so that the training time and shuffling time overlaps and achieve a better overall time efficiency.

I would very much appreciate any feedback or suggestion on how I can make it shuffling operation more efficient.


UPDATE

Having tried some of the suggestions, the following turned out to be fastest I could achieve, which is also surprisingly simple!

%time [np.random.shuffle(x) for x in alist]

CPU times: user 23.7 s, sys: 160 ms, total: 23.9 s
Wall time: 23.9 s

Single thread numpy is the way to go here, it seems!

TerryH
  • 71
  • 3
  • 2
    Did you try Pandas Sample method? https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sample.html – alan.elkin Feb 29 '20 at 02:09
  • One other thing: sometimes it is very useful (and much more efficient) to sample indexes rather than the items in the main list itself, so that you can then access the list through the shuffled indexes. – alan.elkin Feb 29 '20 at 02:28
  • 1
    I shall try the sampling and report back. thanks! I hope the gain in speed offsets the cost of converting between lists and dataframes. – TerryH Feb 29 '20 at 03:57

1 Answers1

2

@TerryH - you need not .shuffle() the RAM-memory-content of aListOfSTRINGs at all, it ought be just enough to generate a np.random.permutation( len( aListOfListsOfSTRINGs[ ith ] ) ) so as to create ad-hoc, at a cost of but O(1) ~ 260 [us] per list, spent ALAP, a new random order, right-sized for an indirect access to the str-members of the ith-aListOfSTRINGs
( why moving RAM-I/O expensive data so as to "read"-in-order somewhere later, when no data need ever be touched, until ALAP "reading" from cache-served block(s) using an indirect-addressing of the components? )

For the Real-World costs of a wish to go parallel, you may like this post, with an interactive graph-tool.


As @user2357112 supports Monica commented below,
shuffling was aimed to take place rather inside aListOfSTRINGs, not on aListOfListsOfSTRINGs, Mea Culpa

Q : "can I achieve the shuffling faster"?

Yes. A lot. ...150 x times - well under 2.5 [s] are achievable with the right-enough tools

Q : "... how I can make it shuffling operation more efficient ?"

The in-place .shuffle() takes less than ~ 23 [s] on list( L ) over 16,000,000 items in plain Py2.7 tools

from zmq import Stopwatch; aClk = Stopwatch() #_______________________ a [us] Stopwatch
pass;    import random

#_____________L creation ~ 2.7 [s]___________________________________________
aClk.start(); L = [ strID for strID in xrange( int( 16E6 ) ) ]; aClk.stop()
2721084

print L[:5] #___________________________________________________________proof
[0, 1, 2, 3, 4]

#_____________random.shuffle( L )______________________________________+proof
aClk.start(); random.shuffle( L ); aClk.stop(); print "0:5\t", L[:5]
21473261
0:5     [13868243, 13087869, 13207292, 9344202, 1853783]

#_____________random.shuffle( L )______________________________________+proof
aClk.start(); random.shuffle( L ); aClk.stop(); print "0:5\t", L[:5]
22573922
0:5     [837396, 15032889, 10942767, 14571341, 4867854]

#_______________________________________________________________________proof
>>> len( L )
16000000

The in-place .shuffle() takes under ~ 48 [s] on list( L ) over 16,000,000 items in plain Py3.5 tools.

$ conda activate py3
$ python
...
aClk.start(); L = [ strID for strID in  range( int( 16E6 ) ) ]; aClk.stop()
1959052

#_____________random.shuffle( L )______________________________________+proof
aClk.start(); random.shuffle( L ); aClk.stop(); print( "0:5\t", L[:5] )
45104806
0:5     [15744525, 10635923, 14530509, 10535840, 1465987]

#_____________random.shuffle( L )______________________________________+proof
aClk.start(); random.shuffle( L ); aClk.stop(); print( "0:5\t", L[:5] )
47139358
0:5     [884437, 15420153, 9957947, 8118734, 11960914]

Let's go get The Real Performance boosted :

import numpy as np

#____________L_as_a32______________16E6________________________~ 74 [ms]
>>> aClk.start(); a32 = np.arange( 16E6, dtype = np.int32 ); aClk.stop()
74054

#_____________np.random.shuffle( a32-bit )______________________________+proof
aClk.start(); np.random.shuffle( a32 ); aClk.stop(); print "0:5\t", a32[:5]
2400786
0:5     [ 2487493 14646705 13717283  5602561  7934593]

aClk.start(); np.random.shuffle( a32 ); aClk.stop(); print "0:5\t", a32[:5]
2368381
0:5     [ 4841042 12882529 12298351  2198866  7054284]

aClk.start(); np.random.shuffle( a32 ); aClk.stop(); print "0:5\t", a32[:5]
2369011
0:5     [14595649  7239135  3339593  9517600  6506681]

#_____________np.random.shuffle( a64-bit )______________________________+proof
aClk.start(); np.random.shuffle( a64 ); aClk.stop(); print "0:5\t", a64[:5]
2424487
0:5     [ 3234133  9224551   971604 13027484   806393]

aClk.start(); np.random.shuffle( a64 ); aClk.stop(); print "0:5\t", a64[:5]
2386873
0:5     [ 3212124 10644428  8192909  2234984 13103406]

aClk.start(); np.random.shuffle( a64 ); aClk.stop(); print "0:5\t", a64[:5]
2376065
0:5     [ 5624301  7741070  8859092 12287465 11721315]

If indeed going to get The Ultimate Performance :

  • maintain all str data as-is, stored in just aListOfSTRINGs
  • each new aListOfSTRINGs append at a constant cost of O(1) to a non-re-shuffled, linearly growing, constant-order storage - aListOfListsOfSTRINGs
  • instead of paying awfully high memory-I/O costs of shuffling that storage making aListOfListsOfSTRINGs, rather maintain a aListOfORDINALs ( be it a plain-list or a numpy-array, where just appending a len( aListOfListsOfSTRINGs ), whenever a new member-aListOfSTRINGs got in )
  • enjoy very fast and very efficient in-place aListOfORDINALs.shuffle(), well under 23 [s] in Py2.7 or < 50 [s] in Py3.5
  • access all str-s as
    aListOfListsOfSTRINGs[aListOfORDINALs[Nth_L_inLoLoStr]][Nth_str_inLoStr] at superfast times at costs of O(1) to get the actual str-s
user3666197
  • 1
  • 6
  • 50
  • 92
  • You seem to have gotten the task mixed up - the question is about performing ~16 million shuffles, each shuffle shuffling one sublist. It's not about shuffling the ~16 million element outer list. – user2357112 Feb 29 '20 at 08:32
  • As the above comment suggested, the task is to shuffle 16m small lists rather than shuffle the list of 16m items. However, if shuffling numpy is faster, then I can try to shuffle an array of indices and then select the shuffled list via the indices. – TerryH Feb 29 '20 at 11:16
  • @TerryH - you need not `.shuffle()` the RAM-memory-content of `aListOfSTRINGs` at all, it ought be just enough to generate a `np.random.permutation( len( aListOfListsOfSTRINGs[ ith ] ) )` so as to create ad-hoc, at a cost of but **`O(1)`** ~ **`260 [us]`** per list, spent ALAP, a new random order, right-sized for an indirect access to the `str`-members of the ith-`aListOfSTRINGs` *( why moving RAM-I/O expensive data so as to "read"-in-order somewhere later, when no data need ever be touched, until ALAP "reading" from cache-served block(s) using an indirect-addressing of the components? )* – user3666197 Feb 29 '20 at 11:33