One answer would be to read one by one each file and spread the lines across N
new files. Doing so, you will obtain "shuffled files" with a similar number of lines and with the same proportion of "original files". Of course, it depends a lot of what kind of shuffled files you would need.
The reading of initial file can be done in parallel, but we would need to coordinate the threads to not write at the same time in the same files. I won't describe that here, because I think it is too much for what is needed here. For example: Python multiprocessing safely writing to a file.
Beside the number of files you have and/or you want, the limiting part below is the shuffling. Given your question, as it is limited to files of 50k lines and machine learning, I think the procedure below is enough. An array of 50k * 10 takes around 4 Mb, so it can be entirely loaded into memory to be shuffled by np.random.shuffle
. If it was much bigger, you need to use another method, see shuffle a large list of items without loading in memory.
Thus, the procedure could be:
- For the original file 1:
- Read the file
- Shuffle the file
- Divide the file into
N
blocks (considering that the N
is higher than the number of rows)
- Write the blocks into the shuffled files
- Go to the next file and restart at 1.1.
First thing first, I generated 50 files with 100,000 lines of 25 Mb each:
import pandas as pd
import numpy as np
for i in range(50):
arr = np.random.randint(1000, size=(100000,10))
with open(f'bigfile-{i}', 'w') as f: np.savetxt(f, arr, delimiter=',')
That's a rough code, but it works:
originalFiles = [f'bigfile-{i}' for i in range(50)] # paths of your original files
nbShuffled = len( originalFiles ) # number of shuffled files (you can choose)
for i, file in enumerate( originalFiles ):
# 1. Read the original file
with open(file, 'r') as f: lines = f.readlines()
# 2. Shuffle the file
np.random.shuffle( lines )
# 3. Estimate number of lines per block
nbLines = len( lines )
firstBlocks = int( np.floor( nbLines / nbShuffled ) )
lastBlock = int( firstBlocks + nbLines % nbShuffled )
blocks = [firstBlocks] * ( nbShuffled - 1 ) + [lastBlock]
# 4. Write the blocks
np.random.shuffle( blocks ) # avoid that the last block is always in the last shuffle file
x = 0
for b in range( nbShuffled ):
with open( f'bigfile_shuffled-{i}', 'a' ) as f: f.writelines( lines[ x : x + blocks[b] ] )
x += blocks[b]
It took ~13s to run on my computer (Linux 64 bits, 32 Go RAM, 16 CPU).