0

I have a input file word.txt.I am trying to splitting the file in 75%-25% randomly in python.

def shuffle_split(infilename, outfilename1, outfilename2):
    from random import shuffle

    with open(infilename, 'r') as f:
        lines = f.readlines()

    # append a newline in case the last line didn't end with one
    lines[-1] = lines[-1].rstrip('\n') + '\n'
    traingdata = len(lines)* 75 // 100
    testdata = len(lines)-traingdata
    with open(outfilename1, 'w') as f:
        f.writelines(lines[:traingdata])
    with open(outfilename2, 'w') as f:
        f.writelines(lines[:testdata])

But this code is writing the first 75% of the original file in the first output file and again the same 25% of the original file in the second output file.Could you please suggest me some way to solve it.

Andy
  • 49,085
  • 60
  • 166
  • 233
  • answer for same question: http://stackoverflow.com/questions/17412439/how-to-split-data-into-trainset-and-testset-randomly – rebeling Sep 20 '15 at 20:15

4 Answers4

2

If you don't want to read all the file in memory I would use something like this. Note that it also supports no shuffling:

import random

def split_file(file,out1,out2,percentage=0.75,isShuffle=True,seed=123):
    """Splits a file in 2 given the `percentage` to go in the large file."""
    random.seed(seed)
    with open(file, 'r',encoding="utf-8") as fin, \
         open(out1, 'w') as foutBig, \
         open(out2, 'w') as foutSmall:

        nLines = sum(1 for line in fin) # if didn't count you could only approximate the percentage
        fin.seek(0)
        nTrain = int(nLines*percentage) 
        nValid = nLines - nTrain

        i = 0
        for line in fin:
            r = random.random() if isShuffle else 0 # so that always evaluated to true when not isShuffle
            if (i < nTrain and r < percentage) or (nLines - i > nValid):
                foutBig.write(line)
                i += 1
            else:
                foutSmall.write(line)

If you're file is sooooo big that you don't want to iterate twice over it (once for counting) then you can split probabilistically. Because the file is so big that would give decent results:

import random

def split_huge_file(file,out1,out2,percentage=0.75,seed=123):
        """Splits a file in 2 given the approximate `percentage` to go in the large file."""
    random.seed(seed)
    with open(file, 'r',encoding="utf-8") as fin, \
         open(out1, 'w') as foutBig, \
         open(out2, 'w') as foutSmall:

        for line in fin:
            r = random.random() 
            if r < percentage:
                foutBig.write(line)
            else:
                foutSmall.write(line)
Yann Dubois
  • 1,195
  • 15
  • 16
1

The problem is that in this line

 f.writelines(lines[:testdata])

you are saying "everything from index 0 to index testdata":

 f.writelines(lines[0:testdata])

which is not what you want. Just change it to

 f.writelines(lines[testdata:])

which means "everything from (testdata) to the end of the list". That should work. Or even simpler

 f.writelines(lines[traingdata + 1:])

This line says "everything from (traindata + 1) to the end of the list".

0

Shuffle your lines first:

shuffle(lines)

Then, you need just need to do a bit of list slicing to get your two sets

import math
TRAINING_RATIO = 0.75    # This is the percentage of the array you want to be training data

...

shuffle(lines)
train, test = lines[:int(math.floor(len(lines)*TRAINING_RATIO))], lines[int(math.ceil(len(lines)*TRAINING_RATIO)):]

At the end of this, you will have two lists train and test. train will contain 75% of your date (plus a big of rounding error). test will contain the rest.

This is done via the following (for train):

lines[:int(math.floor(len(lines)*TRAINING_RATIO))]

This is taking from the beginning of the shuffled list to the 75% mark. For test, it gets the remaining 25%:

lines[int(math.ceil(len(lines)*TRAINING_RATIO)):]

Example, using a file that has numerals 1-20 on there own line (20 lines total), and I stripped the trailing \n:

Train: ['2', '17', '19', '6', '5', '3', '14', '7', '10', '18', '9', '20', '16', '4', '8']
Test: ['12', '15', '13', '1', '11']
Andy
  • 49,085
  • 60
  • 166
  • 233
0

this shuffles the lines read, then saves them separately

outfilename1 = "lines25.txt"
outfilename2 = "lines75.txt"
import random

with open('w2.txt','r') as f:
    lines = f.readlines()

random.shuffle(lines)
numlines = int(len(lines)*0.25)

with open(outfilename1, 'w') as f:
    f.writelines(lines[:numlines])
with open(outfilename2, 'w') as f:
    f.writelines(lines[numlines:])
Pynchia
  • 10,996
  • 5
  • 34
  • 43