Is there a faster or better way to segregate dataset into 80 20 ratio in python?

Question

X.shape  #output is => (2555904, 1024, 2)
X[0] #Output is => array([[ 0.0420274 , 0.23476323], [-0.2728826 , 0.40513492], [-0.26707262, 0.22749889], ..., [-0.7055947 , -0.28693035], [-0.41157472, 0.66826206], [ 0.06487698, 0.6358149 ]], dtype=float32)
total = len(X)
n_train = int(0.8*total) #80% samples in the training dataset 20% in testing
n_test = int(0.2*total)
train_idx = np.random.choice(range(0, total), size=n_train, replace=False) # Randomly selecting 80% of data from total dataset
test_idx = list(set(range(0, total)) - set(train_idx))
train_idx.sort()
test_idx.sort()

X_train = X[train_idx]
X_test = X[test_idx]

I am stuck at the last two lines of this code i.e. X_train and X_test part. It is taking a lot of time to run that part of the code. Is there another way to do this? All I want is to segregate the X data into 80 20 ratios. Any suggestions are welcome.

The dataset that I am using is RadioML2018.01A.

The link for the same is: https://www.kaggle.com/pinxau1000/radioml2018-01a-get-started/data

The main problem I think is the size of the data, how to overcome it and segregate the data?

There's no "instant way" to manipulate 20 GB of memory like that. Your better choice is to load only the parts necessary (aka batch of data) — Alexey S. Larionov, Feb 15 '22 at 13:21
Does this answer your question? [How to split/partition a dataset into training and test datasets for, e.g., cross validation?](https://stackoverflow.com/questions/3674409/how-to-split-partition-a-dataset-into-training-and-test-datasets-for-e-g-cros) — jjramsey, Feb 15 '22 at 13:23

score 0 · Answer 1 · answered Feb 15 '22 at 13:24

0

You can use sklearn.model_selection.train_test_split. Check out the official documentation with an example. You have to split the data in target variable and explanatory variables at first.

answered Feb 15 '22 at 13:24

ImtiazX

1

Is there a faster or better way to segregate dataset into 80 20 ratio in python?

1 Answers1