Efficient way to shuffle data from different large files

Question

For example, what I have are df1 and df2 in different domain:

df1 = pd.DataFrame({"question":["q1","q2"], "answer":["a1","a2"], "domain":"tech"})
df2 = pd.DataFrame({"question":["q3","q4"], "answer":["a3","a4"], "domain":"history"})

print(df1)
  question answer domain
0       q1     a1   tech
1       q2     a2   tech

print(df2)
  question answer   domain
0       q3     a3  history
1       q4     a4  history

What I want is the shuffled data:

print(shuffled1)
  question answer   domain
0       q3     a3  history
1       q1     a1     tech
print(shuffled2)
  question answer   domain
0       q2     a2     tech
1       q4     a4  history

In the real world, I have 60+ csv files from different domain which have same structure. Each file have 50k records. They can not be read into memory at the same time.

What I want to do is to feed these files into a Bert model to train it, but the model will do bad if it learn the data from "history" domain for 10k steps and then learning from "tech" domain of another 10k steps. So I want to shuffle the data in the files, to make multiple domain's data evenly distributed in each file.

Why you say that they can't all of them read in memory? If you have 100 files the size it's about 5M which should be fit just fine in memory. — rpanai, Jan 08 '20 at 09:48
Could you then explain better your goal? It look to me that you want to create an exam with a question per topic randomly extracted from your csv. — rpanai, Jan 08 '20 at 09:50
@DavidS shuffling the data and store them in different files. — Dawei, Jan 08 '20 at 09:50
@Dawei and you want to keep to original intact? also is it a problem to do them file by file? — David, Jan 08 '20 at 09:53
@Dawei, it would be much efficient to use input pipelines from DL frameworks (since you are mentioning Bert). For example `tf.data.Dataset` pipeline — thushv89, Jan 08 '20 at 09:58
Why don't you simply read files in parallel like here https://stackoverflow.com/questions/36587211/easiest-way-to-read-csv-files-with-multiprocessing-in-pandas and shuffle them afterwards? — Sergey Bushmanov, Jan 08 '20 at 10:18

score 1 · Answer 1 · answered Jan 23 '20 at 10:57

One answer would be to read one by one each file and spread the lines across N new files. Doing so, you will obtain "shuffled files" with a similar number of lines and with the same proportion of "original files". Of course, it depends a lot of what kind of shuffled files you would need.

The reading of initial file can be done in parallel, but we would need to coordinate the threads to not write at the same time in the same files. I won't describe that here, because I think it is too much for what is needed here. For example: Python multiprocessing safely writing to a file.

Beside the number of files you have and/or you want, the limiting part below is the shuffling. Given your question, as it is limited to files of 50k lines and machine learning, I think the procedure below is enough. An array of 50k * 10 takes around 4 Mb, so it can be entirely loaded into memory to be shuffled by np.random.shuffle. If it was much bigger, you need to use another method, see shuffle a large list of items without loading in memory.

Thus, the procedure could be:

For the original file 1:
1. Read the file
2. Shuffle the file
3. Divide the file into N blocks (considering that the N is higher than the number of rows)
4. Write the blocks into the shuffled files
Go to the next file and restart at 1.1.

First thing first, I generated 50 files with 100,000 lines of 25 Mb each:

import pandas as pd
import numpy as np

for i in range(50):
    arr = np.random.randint(1000, size=(100000,10))
    with open(f'bigfile-{i}', 'w') as f: np.savetxt(f, arr, delimiter=',')

That's a rough code, but it works:

originalFiles = [f'bigfile-{i}' for i in range(50)] # paths of your original files
nbShuffled = len( originalFiles ) # number of shuffled files (you can choose)

for i, file in enumerate( originalFiles ):
    # 1. Read the original file
    with open(file, 'r') as f: lines = f.readlines()
    # 2. Shuffle the file
    np.random.shuffle( lines )
    # 3. Estimate number of lines per block
    nbLines = len( lines )
    firstBlocks = int( np.floor( nbLines / nbShuffled ) )
    lastBlock = int( firstBlocks + nbLines % nbShuffled )
    blocks = [firstBlocks] * ( nbShuffled - 1 ) + [lastBlock]
    # 4. Write the blocks
    np.random.shuffle( blocks ) # avoid that the last block is always in the last shuffle file
    x = 0
    for b in range( nbShuffled ):
        with open( f'bigfile_shuffled-{i}', 'a' ) as f: f.writelines( lines[ x : x + blocks[b] ] )
        x += blocks[b]

It took ~13s to run on my computer (Linux 64 bits, 32 Go RAM, 16 CPU).

Efficient way to shuffle data from different large files

1 Answers1