65

I have a large dataset and want to split it into training(50%) and testing set(50%).

Say I have 100 examples stored the input file, each line contains one example. I need to choose 50 lines as training set and 50 lines testing set.

My idea is first generate a random list with length 100 (values range from 1 to 100), then use the first 50 elements as the line number for the 50 training examples. The same with testing set.

This could be achieved easily in Matlab

fid=fopen(datafile);
C = textscan(fid, '%s','delimiter', '\n');
plist=randperm(100);
for i=1:50
    trainstring = C{plist(i)};
    fprintf(train_file,trainstring);
end
for i=51:100
    teststring = C{plist(i)};
    fprintf(test_file,teststring);
end

But how could I accomplish this function in Python? I'm new to Python, and don't know whether I could read the whole file into an array, and choose certain lines.

Freya Ren
  • 2,086
  • 6
  • 29
  • 39

8 Answers8

102

This can be done similarly in Python using lists, (note that the whole list is shuffled in place).

import random

with open("datafile.txt", "rb") as f:
    data = f.read().split('\n')

random.shuffle(data)

train_data = data[:50]
test_data = data[50:]
ijmarshall
  • 3,373
  • 2
  • 19
  • 13
  • 7
    nice solution. But what if i don't know about the amount of data in my file that is maybe our data may contain some million of observation and i need to sample the data in 85% and 15% data sets? – desmond.carros Mar 15 '16 at 06:03
  • 7
    @desmond.carros take a look at `from sklearn.cross_validation import train_test_split` So do it this way: `X_fit, X_eval, y_fit, y_eval= train_test_split( train, target, test_size=0.15, random_state=1 )` – Rocketq Mar 23 '16 at 07:23
36
from sklearn.model_selection import train_test_split
import numpy

with open("datafile.txt", "rb") as f:
   data = f.read().split('\n')
   data = numpy.array(data)  #convert array to numpy type array

   x_train ,x_test = train_test_split(data,test_size=0.5)       #test_size=0.5(whole_data)
Rishabh Agrahari
  • 3,447
  • 2
  • 21
  • 22
shubhranshu
  • 399
  • 3
  • 7
  • 4
    Hi, train_test_split accepts python array too. You don't need to transform a python array to numpy array. – Yulin GUO Aug 16 '18 at 09:57
  • 1
    For whoever is wondering, the first item in the tuple that `train_test_split` is the remaining percentage. E.g. `x_train ,x_test = train_test_split(list(range(100)),test_size=0.2)` will return respectively 80 items and 20 items. – Eduardo Pignatelli Oct 01 '19 at 10:44
18

To answer @desmond.carros question, I modified the best answer as follows,

 import random
 file=open("datafile.txt","r")
 data=list()
 for line in file:
    data.append(line.split(#your preferred delimiter))
 file.close()
 random.shuffle(data)
 train_data = data[:int((len(data)+1)*.80)] #Remaining 80% to training set
 test_data = data[int((len(data)+1)*.80):] #Splits 20% data to test set

The code splits the entire dataset to 80% train and 20% test data

Community
  • 1
  • 1
subin sahayam
  • 181
  • 2
  • 9
9

You could also use numpy. When your data is stored in a numpy.ndarray:

import numpy as np
from random import sample
l = 100 #length of data 
f = 50  #number of elements you need
indices = sample(range(l),f)

train_data = data[indices]
test_data = np.delete(data,indices)
JLT
  • 712
  • 9
  • 15
7

You can try this approach

import pandas
import sklearn
csv = pandas.read_csv('data.csv')
train, test = sklearn.cross_validation.train_test_split(csv, train_size = 0.5)

UPDATE: train_test_split was moved to model_selection so the current way (scikit-learn 0.22.2) to do it is this:

import pandas
import sklearn
csv = pandas.read_csv('data.csv')
train, test = sklearn.model_selection.train_test_split(csv, train_size = 0.5)
tschomacker
  • 631
  • 10
  • 18
Roman Gherta
  • 821
  • 2
  • 15
  • 27
7

sklearn.cross_validation is deprecated since version 0.18, instead you should use sklearn.model_selection as show below

from sklearn.model_selection import train_test_split
import numpy

with open("datafile.txt", "rb") as f:
   data = f.read().split('\n')
   data = numpy.array(data)  #convert array to numpy type array

   x_train ,x_test = train_test_split(data,test_size=0.5)       #test_size=0.5(whole_data)
Derek Brown
  • 4,232
  • 4
  • 27
  • 44
Andrew
  • 71
  • 1
  • 4
3

The following produces more general k-fold cross-validation splits. Your 50-50 partitioning would be achieved by making k=2 below, all you would have to to is to pick one of the two partitions produced. Note: I haven't tested the code, but I'm pretty sure it should work.

import random, math

def k_fold(myfile, myseed=11109, k=3):
    # Load data
    data = open(myfile).readlines()

    # Shuffle input
    random.seed=myseed
    random.shuffle(data)

    # Compute partition size given input k
    len_part=int(math.ceil(len(data)/float(k)))

    # Create one partition per fold
    train={}
    test={}
    for ii in range(k):
        test[ii]  = data[ii*len_part:ii*len_part+len_part]
        train[ii] = [jj for jj in data if jj not in test[ii]]

    return train, test      
ImportanceOfBeingErnest
  • 321,279
  • 53
  • 665
  • 712
Lord Henry Wotton
  • 1,332
  • 1
  • 10
  • 11
0

A quick note for the answer from @subin sahayam

 import random
 file=open("datafile.txt","r")
 data=list()
 for line in file:
    data.append(line.split(#your preferred delimiter))
 file.close()
 random.shuffle(data)
 train_data = data[:int((len(data)+1)*.80)] #Remaining 80% to training set
 test_data = data[int(len(data)*.80+1):] #Splits 20% data to test set

If your list size is a even number, you should not add the 1 in the code below. Instead, you need to check the size of the list first and then determine if you need to add the 1.

test_data = data[int(len(data)*.80+1):]

lee
  • 96
  • 5