I have a house price prediction dataset. I have to split the dataset into train
and test
.
I would like to know if it is possible to do this by using numpy
or scipy
?
I cannot use scikit
learn at this moment.

- 2,612
- 2
- 18
- 30

- 125
- 1
- 1
- 6
6 Answers
I know that your question was only to do a train_test_split with numpy
or scipy
but there is actually a very simple way to do it with Pandas :
import pandas as pd
# Shuffle your dataset
shuffle_df = df.sample(frac=1)
# Define a size for your train set
train_size = int(0.7 * len(df))
# Split your dataset
train_set = shuffle_df[:train_size]
test_set = shuffle_df[train_size:]
For those who would like a fast and easy solution.

- 2,440
- 4
- 18
- 26

- 1,163
- 10
- 29
Although this is old question, this answer might help.
This is how sklearn implements train_test_split
, this method given below, takes similar arguments as sklearn.
import numpy as np
from itertools import chain
def _indexing(x, indices):
"""
:param x: array from which indices has to be fetched
:param indices: indices to be fetched
:return: sub-array from given array and indices
"""
# np array indexing
if hasattr(x, 'shape'):
return x[indices]
# list indexing
return [x[idx] for idx in indices]
def train_test_split(*arrays, test_size=0.25, shufffle=True, random_seed=1):
"""
splits array into train and test data.
:param arrays: arrays to split in train and test
:param test_size: size of test set in range (0,1)
:param shufffle: whether to shuffle arrays or not
:param random_seed: random seed value
:return: return 2*len(arrays) divided into train ans test
"""
# checks
assert 0 < test_size < 1
assert len(arrays) > 0
length = len(arrays[0])
for i in arrays:
assert len(i) == length
n_test = int(np.ceil(length*test_size))
n_train = length - n_test
if shufffle:
perm = np.random.RandomState(random_seed).permutation(length)
test_indices = perm[:n_test]
train_indices = perm[n_test:]
else:
train_indices = np.arange(n_train)
test_indices = np.arange(n_train, length)
return list(chain.from_iterable((_indexing(x, train_indices), _indexing(x, test_indices)) for x in arrays))
Of course sklearn's implementation supports stratified k-fold, splitting of pandas series etc. This one only works for splitting lists and numpy arrays, which I think will work for your case.

- 2,612
- 2
- 18
- 30
This code should work (Assuming X_data
is a pandas DataFrame):
import numpy as np
num_of_rows = len(X_data) * 0.8
values = X_data.values
np.random_shuffle(values) #shuffles data to make it random
train_data = values[:num_of_rows] #indexes rows for training data
test_data = values[num_of_rows:] #indexes rows for test data
Hope this helps!

- 5,284
- 7
- 19
- 45

- 152
- 10
-
Thanks.One more question. In the top row i have the column labels. I think i need to remove them. right ? – CODE_DIY Nov 09 '17 at 13:53
-
@CODE_DIY Yes, you should remove the column labels. I recommend you save the column labels and say: df.columns = [(insert column labels here)]. – jaguar Nov 09 '17 at 14:00
-
The sorting at the end is unnecessary. Just keep it shuffled. I would also use the permutation method from numpy's random module instead and index into your dataframe. https://stackoverflow.com/a/29576803/3250829 – rayryeng Jan 23 '22 at 02:48
This solution using pandas and numpy only
def split_train_valid_test(data,valid_ratio,test_ratio):
shuffled_indcies=np.random.permutation(len(data))
valid_set_size= int(len(data)*valid_ratio)
valid_indcies=shuffled_indcies[:valid_set_size]
test_set_size= int(len(data)*test_ratio)
test_indcies=shuffled_indcies[valid_set_size:test_set_size+valid_set_size]
train_indices=shuffled_indcies[test_set_size:]
return data.iloc[train_indices],data.iloc[valid_indcies],data.iloc[test_indcies]
train_set,valid_set,test_set=split_train_valid_test(dataset,valid_ratio=0.2,test_ratio=0.2)
print(len(train_set),len(valid_set),len(test_set))
##out: (16512, 4128, 4128)

- 21
- 3
-
I think you need to replace: train_indices=shuffled_indcies[test_set_size:] with: train_indices=shuffled_indcies[test_set_size+valid_set_size:]. Like that you avoid moving elements to the training set that are already in validation or test sets – Jorge May 22 '22 at 19:52
import numpy as np
import pandas as pd
X_data = pd.read_csv('house.csv')
Y_data = X_data["prices"]
X_data.drop(["offers", "brick", "bathrooms", "prices"],
axis=1, inplace=True) # important to drop prices as well
# create random train/test split
indices = range(X_data.shape[0])
num_training_instances = int(0.8 * X_data.shape[0])
np.random.shuffle(indices)
train_indices = indices[:num_training_indices]
test_indices = indices[num_training_indices:]
# split the actual data
X_data_train, X_data_test = X_data.iloc[train_indices], X_data.iloc[test_indices]
Y_data_train, Y_data_test = Y_data.iloc[train_indices], Y_data.iloc[test_indices]
This assumes you want a random split. What happens is that we're creating a list of indices as long as the number of data points you have, i.e. the first axis of X_data (or Y_data). We then put them in random order and just take the first 80% of those random indices as training data and the rest for testing. [:num_training_indices]
just selects the first num_training_indices from the list. After that you just extract the rows from your data using the lists of random indices and your data is split. Remember to drop the prices from your X_data and to set a seed if you want the split to be reproducible (np.random.seed(some_integer)
in the beginning).

- 349
- 1
- 4
-
I would like to split it in 80% training to 20% test. What will be the code then? – CODE_DIY Nov 09 '17 at 13:00
-
If you want it split 80% to 20%, make the value of your num_train_examples variable equal to 80% of the number of rows in your dataset. If you had 100 rows, you would set it to 80. – jaguar Nov 09 '17 at 13:13
-
@jaguar can you explain `all_data[ :num_train_examples]` ? Are we slicing it? Is there any other source that i can read? – CODE_DIY Nov 09 '17 at 13:20
-
@CODE_DIY I believe that all_data is your dataset and you are slicing it, however you cannot slice Pandas DataFrames with a simple slice like that. I will post an answer that hopefully helps more. – jaguar Nov 09 '17 at 13:23
-
-
This is my code so far, `import pandas as pd X_data= pd.read_csv('house.csv') X_data.drop(['offers','brick','bathrooms'],axis =1,inplace= True) y_data= X_data['price']`. Now i need to split it – CODE_DIY Nov 09 '17 at 13:35
-
@CODE_DIY check my code in the first answer, I believe this will answer your question. – jaguar Nov 09 '17 at 13:45
Here is a quick way to perform an 80/20
split with just the random
import:
import random
# Define a sample size, here 80% of the observations
sample_size = int(len(x)*0.80)
# Set seed for reproducibility
random.seed(47202182)
# indices are randomly sampled from 0 to the length of the original sample
train_idx = random.sample(range(0, len(x)), sample_size)
# Indices not in the train set must be in the test set
test_idx = [i for i in range(0, len(x)) if i not in train_idx]
# apply indices to lists to assign data to corresponding variables
x_train = [x[i] for i in train_idx]
x_test = [x[i] for i in test_idx]
y_train = [y[i] for i in train_idx]
y_test = [y[i] for i in test_idx]

- 768
- 2
- 10
- 20