Train test split without using scikit learn

Question

I have a house price prediction dataset. I have to split the dataset into train and test.
I would like to know if it is possible to do this by using numpy or scipy?
I cannot use scikit learn at this moment.

score 11 · Answer 1 · edited Jan 18 '20 at 13:19

I know that your question was only to do a train_test_split with numpy or scipy but there is actually a very simple way to do it with Pandas :

import pandas as pd 

# Shuffle your dataset 
shuffle_df = df.sample(frac=1)

# Define a size for your train set 
train_size = int(0.7 * len(df))

# Split your dataset 
train_set = shuffle_df[:train_size]
test_set = shuffle_df[train_size:]

For those who would like a fast and easy solution.

score 6 · Answer 2 · answered Jan 31 '20 at 18:21

Although this is old question, this answer might help.

This is how sklearn implements train_test_split, this method given below, takes similar arguments as sklearn.

import numpy as np
from itertools import chain

def _indexing(x, indices):
    """
    :param x: array from which indices has to be fetched
    :param indices: indices to be fetched
    :return: sub-array from given array and indices
    """
    # np array indexing
    if hasattr(x, 'shape'):
        return x[indices]

    # list indexing
    return [x[idx] for idx in indices]

def train_test_split(*arrays, test_size=0.25, shufffle=True, random_seed=1):
    """
    splits array into train and test data.
    :param arrays: arrays to split in train and test
    :param test_size: size of test set in range (0,1)
    :param shufffle: whether to shuffle arrays or not
    :param random_seed: random seed value
    :return: return 2*len(arrays) divided into train ans test
    """
    # checks
    assert 0 < test_size < 1
    assert len(arrays) > 0
    length = len(arrays[0])
    for i in arrays:
        assert len(i) == length

    n_test = int(np.ceil(length*test_size))
    n_train = length - n_test

    if shufffle:
        perm = np.random.RandomState(random_seed).permutation(length)
        test_indices = perm[:n_test]
        train_indices = perm[n_test:]
    else:
        train_indices = np.arange(n_train)
        test_indices = np.arange(n_train, length)

    return list(chain.from_iterable((_indexing(x, train_indices), _indexing(x, test_indices)) for x in arrays))

Of course sklearn's implementation supports stratified k-fold, splitting of pandas series etc. This one only works for splitting lists and numpy arrays, which I think will work for your case.

score 2 · Answer 3 · edited Aug 27 '22 at 16:03

2

This code should work (Assuming X_data is a pandas DataFrame):

import numpy as np
num_of_rows = len(X_data) * 0.8
values = X_data.values
np.random_shuffle(values) #shuffles data to make it random
train_data = values[:num_of_rows] #indexes rows for training data
test_data = values[num_of_rows:] #indexes rows for test data

Hope this helps!

edited Aug 27 '22 at 16:03

blackraven

5,284
7
19
45

answered Nov 09 '17 at 13:28

jaguar

152
10

Thanks.One more question. In the top row i have the column labels. I think i need to remove them. right ? – CODE_DIY Nov 09 '17 at 13:53
@CODE_DIY Yes, you should remove the column labels. I recommend you save the column labels and say: df.columns = [(insert column labels here)]. – jaguar Nov 09 '17 at 14:00
The sorting at the end is unnecessary. Just keep it shuffled. I would also use the permutation method from numpy's random module instead and index into your dataframe. https://stackoverflow.com/a/29576803/3250829 – rayryeng Jan 23 '22 at 02:48

score 2 · Answer 4 · answered Apr 18 '19 at 14:36

This solution using pandas and numpy only

def split_train_valid_test(data,valid_ratio,test_ratio):
    shuffled_indcies=np.random.permutation(len(data))
    valid_set_size= int(len(data)*valid_ratio)
    valid_indcies=shuffled_indcies[:valid_set_size]
    test_set_size= int(len(data)*test_ratio)
    test_indcies=shuffled_indcies[valid_set_size:test_set_size+valid_set_size]
    train_indices=shuffled_indcies[test_set_size:]
    return data.iloc[train_indices],data.iloc[valid_indcies],data.iloc[test_indcies]

train_set,valid_set,test_set=split_train_valid_test(dataset,valid_ratio=0.2,test_ratio=0.2)
print(len(train_set),len(valid_set),len(test_set))
##out: (16512, 4128, 4128)

I think you need to replace: train_indices=shuffled_indcies[test_set_size:] with: train_indices=shuffled_indcies[test_set_size+valid_set_size:]. Like that you avoid moving elements to the training set that are already in validation or test sets — Jorge, May 22 '22 at 19:52

Jens Petersen · Answer 5 · 2017-11-09T14:16:19.993

1

import numpy as np
import pandas as pd

X_data = pd.read_csv('house.csv')
Y_data = X_data["prices"]
X_data.drop(["offers", "brick", "bathrooms", "prices"], 
            axis=1, inplace=True) # important to drop prices as well

# create random train/test split
indices = range(X_data.shape[0])
num_training_instances = int(0.8 * X_data.shape[0])
np.random.shuffle(indices)
train_indices = indices[:num_training_indices]
test_indices = indices[num_training_indices:]

# split the actual data
X_data_train, X_data_test = X_data.iloc[train_indices], X_data.iloc[test_indices]
Y_data_train, Y_data_test = Y_data.iloc[train_indices], Y_data.iloc[test_indices]

This assumes you want a random split. What happens is that we're creating a list of indices as long as the number of data points you have, i.e. the first axis of X_data (or Y_data). We then put them in random order and just take the first 80% of those random indices as training data and the rest for testing. [:num_training_indices] just selects the first num_training_indices from the list. After that you just extract the rows from your data using the lists of random indices and your data is split. Remember to drop the prices from your X_data and to set a seed if you want the split to be reproducible (np.random.seed(some_integer) in the beginning).

edited Nov 09 '17 at 14:16

answered Nov 09 '17 at 12:55

Jens Petersen

349
1
4

I would like to split it in 80% training to 20% test. What will be the code then? – CODE_DIY Nov 09 '17 at 13:00
If you want it split 80% to 20%, make the value of your num_train_examples variable equal to 80% of the number of rows in your dataset. If you had 100 rows, you would set it to 80. – jaguar Nov 09 '17 at 13:13
@jaguar can you explain `all_data[ :num_train_examples]` ? Are we slicing it? Is there any other source that i can read? – CODE_DIY Nov 09 '17 at 13:20
@CODE_DIY I believe that all_data is your dataset and you are slicing it, however you cannot slice Pandas DataFrames with a simple slice like that. I will post an answer that hopefully helps more. – jaguar Nov 09 '17 at 13:23
@CODE_DIY please check my answer I think this might help more. – jaguar Nov 09 '17 at 13:29
This is my code so far, `import pandas as pd X_data= pd.read_csv('house.csv') X_data.drop(['offers','brick','bathrooms'],axis =1,inplace= True) y_data= X_data['price']`. Now i need to split it – CODE_DIY Nov 09 '17 at 13:35
@CODE_DIY check my code in the first answer, I believe this will answer your question. – jaguar Nov 09 '17 at 13:45

score 0 · Answer 6 · answered May 01 '23 at 14:18

Here is a quick way to perform an 80/20 split with just the random import:

import random
# Define a sample size, here 80% of the observations
sample_size = int(len(x)*0.80)
# Set seed for reproducibility
random.seed(47202182)
# indices are randomly sampled from 0 to the length of the original sample
train_idx = random.sample(range(0, len(x)), sample_size)
# Indices not in the train set must be in the test set
test_idx = [i for i in range(0, len(x)) if i not in train_idx]
# apply indices to lists to assign data to corresponding variables
x_train = [x[i] for i in train_idx]
x_test = [x[i] for i in test_idx]
y_train = [y[i] for i in train_idx]
y_test = [y[i] for i in test_idx]

Train test split without using scikit learn

6 Answers6

Linked