1

I am working with a very large dataset ~50GB and I am trying to sample it to reduce its size. The sampling parameter should be dynamic, for example 20% of the initial dataset.

I usually use the sklearn.model_selection package with the train_test_split function. This function allows me to obtain a stratified sample of the dataset based on the class label (that is binary in this specific case).

For the dataset I am considering I need one more split based on the identifier. If an identifier, which can be present multiple times, appears in the training set one or more times it cannot appear in the test set.

To summarize what I need is:

  1. reduce the dataset size (always keeping the class label frequency)
  2. get a train test split:

    2.1 stratified with respect to the class label

    2.2 with the identifier constraint defined before

An example of the dataset structure is:

ID    F1    LABEL
1     2     1
2     2     1
1     2     1
4     2     1
6     2     1
1     2     1
4     2     0
6     2     0
4     2     0
6     2     0

Reducing the size by 50%:

ID    F1    LABEL
1     2     1
2     2     1
1     2     1
4     2     0
6     2     0

A possible Train & Test split:

      Train
ID    F1    LABEL
1     2     1
1     2     1
4     2     0

      Test
ID    F1    LABEL
2     2     1
4     2     0

Notice that the ID split is not ordered and so I cannot use any mask with pandas as it suggested in many answer to the split problem (Pandas split DataFrame by column value). At the moment my pipeline has the following steps:

Reducing dataset dimension:

 X_tr,Y_te,l_tr,l_te = sklearn.model_selection.train_test_split(test_size = 0.5)
 (join on test removing train)

Stratified sample with selection on ID:

(same as before, join both on train and test sets to reconstruct the two datasets)
The ID selection is performed getting the set of all the IDs and randomly choose the ID for training and testing.

As you may notice, this solution is really slow and I am wondering how is it possible to reduce the computational time (more than 3 hours now).

Guido Muscioni
  • 1,203
  • 3
  • 15
  • 37

0 Answers0