3

I have been using a CSV data to implement my scripts and wanted to sample the data into two datasets:

  1. Test Data
  2. Train Data

i want to sample the data sets in 85% and 15% divisions and want to output two CSV files Test.csv and Train.csv

i want it to do in base Python and do not want to use any other external module like Numpy, SciPy, Pandas or Scikitlearn. Can anyone help me out in random sampling of data by percentage. Moreover i will be provided with the datasets that may have random number of observations. SO far i have just read about Pandas and various other modules to sample the data by percentage basis and have not got any concrete solution for my problem.

Moreover i want to retain the headers of the CSV in both the files. Because headers would make each row accessible and can be used in further analysis.

desmond.carros
  • 372
  • 2
  • 21
  • Your post is very broad. Detail what you have already tried. Use question mark to make clear what question you are asking. – Martin Cowie Mar 15 '16 at 11:24
  • @MartinCowie i just have studied on web search. Not tried anything so far. Was searching for a logic and i want to create two files from existing file. `Test.csv` and `Train.csv` from a masterfile `data.csv` i want that 85% of data should be in `test.csv` and rest 15% data in `train.csv` – desmond.carros Mar 15 '16 at 11:45
  • Why do you want 85% data as test data and 15% as training data? Most probably you need 85% data for training and remaining as test data. – Anup Verma May 18 '19 at 07:49

2 Answers2

2

Use random.shuffle to create a random permutation of your dataset and slice it as you wish:

import random
random.shuffle(data)
train = data[:int(len(data)*0.85)]
test = data[len(train):]

Since you requested a specific solution to partition a potentially large CSV file into two files for training and test data, I'll also show how that could be done using a similar approach like the general method described above:

import random

# Count lines
with open('data.csv','r') as csvf:
    linecount = sum(1 for lines in csvf if line.strip() != '')

# Create index sets for training and test data
indices = list(range(linecount))
random.shuffle(indices)
ind_test = set(indices[:int(linecount*0.15)])
del indices

# Partition CSV file
with open('data.csv','r') as csvf, open('train.csv','w') as trainf, open('test.csv','w') as testf:
    i = 0
    for line in csvf:
        if line.strip() != '':
            if i in ind_test:
                testf.write(line.strip() + '\n')
            else:
                trainf.write(line.strip() + '\n')

Thereby, I assume that the CSV file contains one observation per row.

This will create an accurate 85:15 split. If less accurate partitions are okay for you, the solution of Peter Wood would be much more efficient.

Callidior
  • 2,899
  • 2
  • 18
  • 28
  • What do you mean by "CSV data"? You have not mentioned how you store your data in the question, so I just assumed that `data` is a sequence of observations. – Callidior Mar 15 '16 at 11:25
  • i am sorry if i haven't mentioned. But as of now my data is in CSV format and i wish to sample the data accordingly. but anyways thanks. :) – desmond.carros Mar 15 '16 at 11:26
  • @desmond.carros How big are your CSV files? This expects them to all be in memory at once. – Peter Wood Mar 15 '16 at 11:39
  • @PeterWood the CSV files may be in GigaBytes that is it may contain million entries or more. – desmond.carros Mar 15 '16 at 11:41
2

Use the random function in the random module to get a uniformly distributed random number between 0 and 1.

If it's > .85 write to training data, else the test data. See How do I simulate flip of biased coin in python?.

import random

with open(input_file) as data:
    with open(test_output, 'w') as test:
        with open(train_output, 'w') as train:
            header = next(data)
            test.write(header)
            train.write(header)
            for line in data:
                if random.random() > 0.85:
                    train.write(line)
                else:
                    test.write(line)
Peter Wood
  • 23,859
  • 5
  • 60
  • 99